Diseñador de Monitoreo y Alertas
PRODiseña sistemas de observabilidad completos con alertas basadas en SLOs, reglas multi-burn-rate, reducción de fatiga de alertas e integración con respuesta a incidentes.
Ejemplo de Uso
Diseña una estrategia de alertas basada en SLOs para nuestro servicio de checkout con 99.99% disponibilidad y p99 latencia < 500ms. Recibimos +200 alertas/día con muchos falsos positivos.
Cómo Usar Este Skill
Copiar el skill usando el botón de arriba
Pegar en tu asistente de IA (Claude, ChatGPT, etc.)
Completa tus datos abajo (opcional) y copia para incluir con tu prompt
Envía y comienza a chatear con tu IA
Personalización Sugerida
| Descripción | Por defecto | Tu Valor |
|---|---|---|
| Target SLO percentage (e.g., 99.95 for 99.95% availability) | 99.95 | |
| Time window for SLO evaluation (e.g., 30d, 7d, 1h) | 30d | |
| Burn rate multiplier for critical/page alerts | 14.4 | |
| Burn rate multiplier for warning/ticket alerts | 1.0 | |
| Target monitoring platform (prometheus, datadog, dynatrace, grafana) | prometheus | |
| Distributed tracing backend (jaeger, zipkin, tempo, datadog) | jaeger |
Design comprehensive observability systems that provide real-time visibility into system health, performance, and reliability. Create SLO-based alerting strategies with multi-burn-rate rules, reduce alert fatigue through intelligent optimization, and integrate monitoring with incident response workflows for faster resolution.
Fuentes de Investigación
Este skill fue creado usando investigación de estas fuentes autorizadas:
- From Monitoring to Observability: A Paradigm Shift in IT Operations Comprehensive guide on the shift from traditional monitoring to observability covering logs, metrics, and traces
- Ways to Alert on Significant Events (Google SRE Workbook) Official Google approach to multi-burn-rate and multi-window SLO-based alerting strategies
- Designing Tomorrow's Observability: Software Architect's Guide Deep dive into observability architecture, tool selection, and implementation patterns
- Monitoring Distributed Cloud-Based Microservices Framework for monitoring cloud microservices covering APM, infrastructure health, and log aggregation
- Intelligent Alerting with AI-Powered Anomaly Detection Modern ML approaches to noise reduction including predictive alerting and Holt-Winters forecasting
- SLO Monitoring Guide - Measuring Service Reliability Practical guide on SLO setup, SLI definition, and actionable threshold configuration
- How We Use Sloth for SLO Monitoring with Prometheus Real-world implementation of multi-window, multi-burn-rate alerting at Mattermost
- Observability Best Practices - Embrace.io Best practices including actionable alerts, cross-department collaboration, and data quality