Platform monitoring, in simple terms, is a way a team or an Engineer proactively plans against disruption of services and/or gets insights/visibility on service performance using tools that provide relevant metrics and options for setting alerts whenever unwanted events occur. A common term used to describe this practice is SRE (Site Reliability Engineering).
Site Reliability Engineering (SRE)
Site Reliability Engineering focuses on creating scalable and highly reliable services and software systems. SRE is not just about ensuring your systems are available; it's about maintaining a balance between reliability and feature development. Below are some key principles:
Service Level Objectives (SLOs): Define measurable SLOs that reflect the user experience. These are the targets for system reliability and performance that you commit to maintaining, like a metric to set expectations.
Error Budgets: SREs manage error budgets, representing the acceptable downtime or errors in a given period. Staying within the error budget allows teams to focus on feature development.
Automation: Automate repetitive operational tasks, such as scaling, provisioning, and recovery, to reduce the risk of human error and improve efficiency.
Monitoring and Alerting: Implement robust monitoring and alerting systems to detect issues proactively and respond promptly.
Incident Management: Develop well-defined incident management processes and SOPs (Standard Operating Procedures) to minimize downtime and learn from incidents to prevent recurrence.
Getting Started with Platform Monitoring Tools
Effective platform monitoring is a cornerstone of SRE. Here's how to get started:
1. Define Monitoring Objectives
Begin by defining what you need to monitor. Consider factors like user experience, system performance, resource utilization, and security. Identify key performance indicators (KPIs) that align with your SLOs.
2. Choose Monitoring Tools
Select monitoring tools that align with your monitoring objectives. Some popular choices include:
Prometheus: An open-source monitoring and alerting toolkit designed for reliability and scalability.
Grafana: A platform for creating, sharing, and exploring dashboards and data visualizations.
New Relic: A comprehensive application performance monitoring (APM) solution.
Datadog: A cloud-based monitoring and analytics platform. It is a central monitoring platform enabling the application to log events and collect relevant configurable metrics, which can be used for alerting with event triage depending on the set priority.
Fire-hydrant: This tool can be used for incident orchestration, logging and playbook automation.
PagerDuty: It can smoothly work with the above tools to handle escalations for configurable levels, e.g., an SME Engineer, Operations Center, Incident Owner, etc.
3. Instrument Your Code
Instrumentation involves adding code to your application to collect data and metrics. These metrics can include response times, error rates, resource usage, and more. Proper instrumentation is crucial for effective monitoring.
4. Set Up Alerts
Configure alerts based on your monitoring data and SLOs. Alerts should be actionable, meaning they indicate a problem that requires immediate attention. Alerting techniques can include phone calls/notifications, SMS, emails, slack notifications and focus groups/channel creation. Avoid alert fatigue by setting meaningful thresholds.
5. Implement Observability
Observability goes beyond monitoring and includes the ability to explore and understand your system's behavior. It often involves logging, distributed tracing, and structured data.
6. Create Dashboards
Build dashboards that display key metrics and data visualizations. Dashboards make it easy to track the health of your systems and provide a quick overview for incident response.
7. Practice Incident Response
Develop incident response procedures and conduct regular drills. Ensure that your team knows how to react when an incident occurs, and focus on resolving issues quickly while learning from each incident to prevent future occurrences.
Incident management and reporting
An incident is an event that disrupts or reduces the quality of a service. An incident is resolved when the affected service resumes functionality as expected or usual. It is important to review each incident in hindsight as retrospectives to proactively plan for prevention/mitigation in the future or better ways to handle the situation and properly understand its causes or potential causes and effects.
Creating internal processes
Incident Template: A document that provides relevant data at a glance that would help understand what happened, the status and the impact of the incident. This can contain sections showing fields like the environment, product/service/functionality impacted as well as the author of the document, assigned teams with links to group conversations, e.g. a Fire-hydrant generated slack channel link or internal ticket with as many relevant references as possible.
Service outage: Depending on the nature of your business, it is important to keep your users updated on the status of your services with outages and impacts.
Incident reports: Teams can collate multiple incidents, showing the most relevant information at a glance, like the service details, a summary of the incident, impact, more details, time to recovery, measures taken, and associated tickets/docs.
Incident retrospectives: Based on incidents discovered in the past, teams can create action plans to improve their efficiency in handling incidents/risks. For example knowledge-sharing initiatives like bi-monthly sessions with Q&A across teams on common issues of different priority levels, quick mitigation strategies, and standard operation procedures.
Standard Operation Procedures(SOPs): From basic practices like rolling back releases to steps one can take to troubleshoot/debug or drill down issues to the source. All these can vary from team to team.They should outline:
- How to verify that the issue is happening
- Procedure to follow to fix issues
- What team or whom should be contacted at different escalation levels, and what information should be provided to them.