Observability in Serverless
Overview of tracing, metrics, logging, and alerting strategies for serverless applications, with multi-cloud considerations and recommended tooling.
Simple Explanation
What it is
Observability is how you see what your serverless system is doing through logs, metrics, and traces.
Why we need it
When something breaks, you need evidence, not guesses. Observability makes that possible.
Benefits
- Faster diagnosis of errors and slow requests.
- Clear visibility across services and regions.
- Better reliability because problems are found early.
Tradeoffs
- More setup for dashboards and alerts.
- Ongoing cost for log storage and tracing.
Real-world examples (architecture only)
- Trace shows slow database call -> Fix query.
- Alert triggers on error spike -> Rollback.
What This Lesson Covers
- Logging strategy and structured logs
- Metrics that matter (latency, errors, saturation)
- Distributed tracing and correlation IDs
- Alerting thresholds and SLOs
- Multi-cloud visibility patterns
Core Concepts
- Logs: Details of a single event
- Metrics: Aggregated trends over time
- Traces: End-to-end request flow
- SLOs: Reliability targets
- Alerting: Automated responses to anomalies
Structured Logging (Python)
import json
from datetime import datetime
def log_event(level, message, **data):
print(json.dumps({
"timestamp": datetime.utcnow().isoformat(),
"level": level,
"message": message,
**data,
}))
def handler(event, context):
log_event("INFO", "Request received", requestId=context.aws_request_id)
# ...
Correlation IDs
Add a request ID to every log line so you can trace one request across services.
def get_request_id(event, context):
return event.get("headers", {}).get("x-request-id") or context.aws_request_id
Metrics That Matter
- Error rate (percentage of failed requests)
- Latency (p50, p95, p99)
- Throttles (rate limiting)
- Cost drivers (duration, memory, external calls)
Tracing Strategy
- Sample a percentage of requests
- Always trace errors
- Add annotations (userId, region, tier)
Multi-Cloud Observability
Patterns:
- Standardize log format (JSON)
- Use consistent metric names
- Mirror dashboards across clouds
Alerting & SLOs
Define reliability targets and alert on breaches:
- API read: p99 < 100ms
- API error rate: < 1%
- Queue delay: < 2 minutes
Project
Design an observability plan for a serverless API.
Deliverables:
- List the logs you will capture
- Define 3 key metrics and alert thresholds
- Describe how you will trace a request end-to-end
Email your work to [email protected].
References
- AWS CloudWatch: https://docs.aws.amazon.com/cloudwatch/
- AWS X-Ray: https://docs.aws.amazon.com/xray/
- Google Cloud Logging: https://cloud.google.com/logging/docs
- Cloud Monitoring: https://cloud.google.com/monitoring/docs
- Cloud Trace: https://cloud.google.com/trace/docs