---
name: kubernetes-troubleshooter
version: 1.0.0
---

# Kubernetes Troubleshooter - Initialization

Systematically debug Kubernetes failures across pods, networking, storage, RBAC, deployments, and nodes. Provides a symptom-driven decision tree, exact kubectl commands for diagnosis, root cause analysis, and step-by-step resolution for EKS, GKE, AKS, and self-managed clusters.

## Package Structure

```
kubernetes-troubleshooter/
├── SKILL.md    # Main skill prompt (copy to AI assistant)
└── INIT.md     # This initialization file
```

## Files to Generate

None required - this is a prompt-only skill.

## Post-Installation Steps

### Claude Code Users

```bash
# Copy skill to Claude Code skills directory
cp -r kubernetes-troubleshooter/ ~/.claude/skills/kubernetes-troubleshooter/
```

### Other AI Assistants (ChatGPT, Gemini, Copilot)

1. Open `SKILL.md`
2. Copy all content after the frontmatter (after the second `---`)
3. Paste into your AI assistant as a system prompt or initial message
4. Describe your Kubernetes issue to begin troubleshooting

## Variables Reference

| Variable | Default | Description |
|----------|---------|-------------|
| `{{error_message}}` | `CrashLoopBackOff` | The error message or symptom (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending, DNS failure, 503 errors) |
| `{{resource_type}}` | `pod` | Kubernetes resource type: pod, service, deployment, ingress, pvc, statefulset, daemonset, job, node |
| `{{cluster_type}}` | `EKS` | Managed provider or self-managed: EKS, GKE, AKS, self-managed, k3s, minikube |
| `{{namespace}}` | `default` | Kubernetes namespace where the issue is occurring |
| `{{symptoms}}` | `pod not starting` | Observed symptoms: pod not starting, service unreachable, deployment stuck, evictions |

## Quick Usage Examples

### CrashLoopBackOff with Database Connection Error

```
My pod is stuck in CrashLoopBackOff:

Error: failed to connect to database at postgres:5432 - connection refused
Container: api-server
Namespace: production
Cluster: EKS (us-east-1)

The database pod is running and healthy. This started after a deployment
update 30 minutes ago.
```

### ImagePullBackOff on Private ECR Registry

```
Pods in my new deployment are stuck in ImagePullBackOff:

Image: 123456789.dkr.ecr.us-west-2.amazonaws.com/myapp:v2.3.1
Namespace: staging
Cluster: EKS
Error in events: "unauthorized: authentication required"

I'm using a new ECR repository that was created yesterday.
```

### Service Unreachable After Deployment

```
After deploying a new version, the service is returning 503 errors:

Service: frontend-svc (ClusterIP, port 80 -> targetPort 3000)
Deployment: frontend (3 replicas, all Running)
Namespace: production
Cluster: GKE
Ingress: GCE Ingress with managed certificate

The pods show Ready 1/1 but the service has no endpoints.
```

### PVC Stuck in Pending on AKS

```
My StatefulSet pod is stuck in Pending because the PVC cannot bind:

PVC: data-cassandra-0
StorageClass: managed-premium
Size: 100Gi
AccessMode: ReadWriteOnce
Namespace: databases
Cluster: AKS

Events show: "waiting for first consumer to be created before binding"
but the pod itself is also Pending.
```

### HPA Not Scaling Under Load

```
My HPA is not scaling up even though response times are degrading:

Deployment: order-service (currently 2 replicas)
HPA target: 70% CPU utilization
Current CPU from kubectl top: 95%
But HPA shows: <unknown>/70%
Namespace: production
Cluster: EKS
```

## Troubleshooting Coverage

### Pod Failures (8 patterns)

| Status | Root Causes Covered |
|--------|-------------------|
| CrashLoopBackOff | App config errors, dependency unavailable, liveness probe failure, bad entrypoint, exit code analysis (0, 1, 126, 127, 137, 139, 143) |
| ImagePullBackOff | Wrong tag, private registry auth, ECR token expiry, GCR/Artifact Registry permissions |
| OOMKilled | Memory limits too low, memory leaks, JVM heap misconfiguration |
| Pending | Insufficient resources, node affinity/taints, PVC not bound, ResourceQuota exceeded |
| Init:Error | Dependency not ready, permission errors, missing ConfigMaps |
| Evicted | Disk/memory/PID pressure on nodes |
| CreateContainerConfigError | Missing ConfigMap/Secret, invalid SecurityContext |
| RunContainerError | Invalid command, volume conflicts, security violations |

### Networking (6 areas)

| Issue | Diagnostic Approach |
|-------|-------------------|
| Service unreachable | Endpoint verification, selector match, port mapping, debug pod testing |
| DNS failures | CoreDNS health, ndots config, search domains, FQDN resolution |
| Ingress not working | IngressClass, backend endpoints, TLS certificates, controller logs, provider annotations |
| NetworkPolicy blocking | Policy listing, podSelector matching, connectivity testing |
| Cross-namespace | FQDN format, namespaceSelector in NetworkPolicies |
| External connectivity | Egress policies, NAT gateway, cloud firewall rules |

### Storage (3 areas)

| Issue | Diagnostic Approach |
|-------|-------------------|
| PVC Pending | StorageClass existence, CSI provisioner, capacity limits, access mode mismatch |
| Mount errors | Permission (fsGroup), SubPath validation, readOnly conflicts |
| StatefulSet | PVC lifecycle, zone-locked PV, WaitForFirstConsumer binding |

### RBAC (3 areas)

| Issue | Diagnostic Approach |
|-------|-------------------|
| 403 Forbidden | ServiceAccount identification, auth can-i testing, RoleBinding inspection |
| ServiceAccount | IRSA (EKS), Workload Identity (GKE/AKS), annotation verification |
| Scope confusion | Role vs ClusterRole, namespaced vs cluster-scoped resources |

### Deployments (3 areas)

| Issue | Diagnostic Approach |
|-------|-------------------|
| Rollout stuck | ReplicaSet status, pod failures, PodDisruptionBudget, strategy analysis |
| Rollback | History inspection, undo commands, revision targeting |
| HPA | Metrics-server health, resource requests, min/max, cooldown |

### Nodes (4 conditions)

| Condition | Diagnostic Approach |
|-----------|-------------------|
| NotReady | Kubelet status, container runtime, network partition, certificates |
| DiskPressure | Container image cleanup, log rotation, storage expansion |
| MemoryPressure | Top consumers, resource limits, cluster autoscaler |
| PIDPressure | Process count, podPidsLimit configuration |

### Managed K8s (3 providers)

| Provider | Common Issues Covered |
|----------|---------------------|
| EKS | ECR auth, VPC CNI IP exhaustion, IRSA/OIDC, ALB Controller, EBS zone locking |
| GKE | Autopilot minimums, Workload Identity, BackendConfig, Cloud DNS, auto-repair |
| AKS | Azure Disk latency, NSG blocking, federated credentials, AGIC health probes, Spot eviction |

## Companion Skills

- **Docker Expert** - Container image building and optimization for K8s workloads
- **DevOps Expert** - CI/CD pipelines and infrastructure automation
- **CI/CD Pipeline AI Optimizer** - Pipeline optimization for K8s deployments
- **Monitoring & Alerting Designer** - Proactive detection before issues escalate
- **Incident Response Playbook Builder** - Handle major cluster incidents with structured procedures

## Best Practices

**Do:**
- Check `kubectl describe` and events before deeper debugging
- Use `kubectl logs --previous` for crash logs from restarted containers
- Keep a debug pod image available (nicolaka/netshoot) for network troubleshooting
- Set resource requests and limits on all production workloads
- Use `WaitForFirstConsumer` volume binding mode for zone-aware storage
- Test RBAC permissions with `kubectl auth can-i` before deploying

**Don't:**
- Skip checking endpoints when a service is unreachable (most common miss)
- Ignore exit codes - they pinpoint the exact failure category
- Set memory limits equal to requests (allow some headroom for spikes)
- Forget that CoreDNS ndots:5 causes slow external DNS resolution
- Assume the same troubleshooting applies across EKS/GKE/AKS (provider-specific gotchas differ)

---
Downloaded from [Find Skill.ai](https://findskill.ai)
