Site Reliability Engineering (SRE) Course Content
1. Introduction to Site Reliability Engineering (SRE)
- Overview of SRE: What is SRE and its origin at Google? SRE vs. DevOps: Key differences and similarities.
- Core Concepts in SRE: SLIs, SLOs, SLAs, Error Budgets, and their role in balancing feature development vs reliability.
- Principles of SRE: Automating operations, measuring reliability, and reducing toil.
2. Monitoring and Observability Tools
Prometheus (Core and Advanced Topics)
- Introduction to Prometheus: Overview, architecture, components.
- Setting up Prometheus: Configuration, exporters, scraping metrics.
- Prometheus Metrics Types: Counters, Gauges, Histograms, and Summaries.
- Prometheus Query Language (PromQL): Writing and using complex queries.
- Advanced PromQL Features: Subqueries, joins, regex, alerting, and anomaly detection.
- Alerting with Prometheus: Configuring alert rules, integrating with Alertmanager.
- Federation in Prometheus: Multi-cluster monitoring and scaling Prometheus.
- Prometheus Security: Authentication, TLS, and securing endpoints.
Grafana (Core and Advanced Topics)
- Introduction to Grafana: Visualizing monitoring data from Prometheus.
- Creating Dashboards: Visualizing time-series and logs data in different formats.
- Advanced Querying in Grafana: SQL, InfluxQL, PromQL, and combining multiple data sources.
- Grafana Plugins: Installing, configuring, and using plugins for enhanced visualization.
- Alerting with Grafana: Setting up thresholds, webhooks, and notifications.
- High Availability (HA) Setup: Ensuring Grafana’s uptime and backing up dashboards.
Kubernetes (Core and Advanced Topics)
- Introduction to Kubernetes: Core components, Pods, Nodes, Services, Deployments.
- Kubernetes Cluster Setup: Installing and configuring Kubernetes.
- Kubernetes Networking: Understanding the networking model and ingress controllers.
- Kubernetes Operators: Automating application lifecycle management.
- Helm Charts: Packaging, deploying, and managing Kubernetes applications.
- Kubernetes Security: RBAC, secrets management, securing Kubernetes clusters.
- Kubernetes Federation: Managing multi-cluster Kubernetes environments.
3. Infrastructure Automation and Configuration Management Tools
Terraform (Core and Advanced Topics)
- Introduction to Terraform: Understanding infrastructure as code.
- Terraform Configuration: Writing configurations, managing resources.
- State Management: Remote backends, state versioning, and locking.
- Advanced Terraform Features: Modules, workspaces, CI/CD pipelines integration.
- Terraform Cloud and Enterprise: Collaborative infrastructure management.
Ansible (Core and Advanced Topics)
- Introduction to Ansible: Playbooks, Roles, Inventory, and Modules.
- Writing Ansible Playbooks: Creating simple and complex automation tasks.
- Ansible Vault: Encrypting sensitive data for security.
- Dynamic Inventory: Integrating Ansible with cloud providers for automated discovery.
- Ansible Tower/AWX: Managing multiple automation tasks with Ansible Tower.
- Ansible Collections: Using and creating reusable Ansible collections.
4. Incident Management and Automation Tools
PagerDuty (Core and Advanced Topics)
- Introduction to PagerDuty: Incident tracking and management.
- Advanced On-Call Scheduling: Configuring escalations, time zones, rotations.
- Integrating PagerDuty with Monitoring Tools: Triggering incidents based on alerts.
- Incident Collaboration: Multi-team collaboration during major incidents.
- PagerDuty API: Automating incident creation, updates, escalations.
5. Containerization and Orchestration Tools
Docker (Core and Advanced Topics)
- Introduction to Docker: What is Docker and why it’s essential?
- Docker Compose: Managing multi-container applications.
- Docker Swarm: Setting up a Docker Swarm cluster.
- Advanced Docker Builds: Multi-stage builds and optimizing image layers.
- Docker Security: Best practices for securing containers and images.
Jenkins (Core and Advanced Topics)
- Introduction to Jenkins: Setting up Jenkins for CI/CD pipelines.
- Jenkins Pipeline: Writing Declarative and Scripted pipelines.
- Jenkins Integration with Kubernetes: Running Jenkins agents in Kubernetes.
- Jenkins Security: Managing roles, permissions, and credentials securely.
- Jenkins Blue Ocean: Visualizing and debugging pipelines with Blue Ocean.
6. Chaos Engineering Tools
Gremlin (Core and Advanced Topics)
- Introduction to Chaos Engineering: Key principles of fault injection.
- Using Gremlin for Fault Injection: Simulating resource exhaustion, latency, etc.
- Gremlin Attack Types: Simulating network failures, CPU spikes, memory leaks.
- Post-Incident Analytics: Analyzing system performance after chaos experiments.