Designer de Monitoramento e Alertas
PRODesenha sistemas abrangentes de observabilidade com alerting baseado em SLO, regras multi-burn-rate, redução de fadiga de alertas e integração de resposta a incidentes para sistemas distribuídos e microservices.
Como Usar Este Skill
Copiar o skill usando o botão acima
Colar no seu assistente de IA (Claude, ChatGPT, etc.)
Preencha suas informações abaixo (opcional) e copie para incluir com seu prompt
Envie e comece a conversar com sua IA
Personalização Sugerida
| Descrição | Padrão | Seu Valor |
|---|---|---|
| Target SLO percentage (e.g., 99.95 for 99.95% availability) | 99.95 | |
| Time window for SLO evaluation (e.g., 30d, 7d, 1h) | 30d | |
| Burn rate multiplier for critical/page alerts | 14.4 | |
| Burn rate multiplier for warning/ticket alerts | 1.0 | |
| Target monitoring platform (prometheus, datadog, dynatrace, grafana) | prometheus | |
| Distributed tracing backend (jaeger, zipkin, tempo, datadog) | jaeger |
Design comprehensive observability systems that provide real-time visibility into system health, performance, and reliability. Create SLO-based alerting strategies with multi-burn-rate rules, reduce alert fatigue through intelligent optimization, and integrate monitoring with incident response workflows for faster resolution.
Fontes de Pesquisa
Este skill foi criado usando pesquisa destas fontes confiáveis:
- From Monitoring to Observability: A Paradigm Shift in IT Operations Comprehensive guide on the shift from traditional monitoring to observability covering logs, metrics, and traces
- Ways to Alert on Significant Events (Google SRE Workbook) Official Google approach to multi-burn-rate and multi-window SLO-based alerting strategies
- Designing Tomorrow's Observability: Software Architect's Guide Deep dive into observability architecture, tool selection, and implementation patterns
- Monitoring Distributed Cloud-Based Microservices Framework for monitoring cloud microservices covering APM, infrastructure health, and log aggregation
- Intelligent Alerting with AI-Powered Anomaly Detection Modern ML approaches to noise reduction including predictive alerting and Holt-Winters forecasting
- SLO Monitoring Guide - Measuring Service Reliability Practical guide on SLO setup, SLI definition, and actionable threshold configuration
- How We Use Sloth for SLO Monitoring with Prometheus Real-world implementation of multi-window, multi-burn-rate alerting at Mattermost
- Observability Best Practices - Embrace.io Best practices including actionable alerts, cross-department collaboration, and data quality