Observability in Serverless

Overview of tracing, metrics, logging, and alerting strategies for serverless applications, with multi-cloud considerations and recommended tooling.

Simple Explanation

What it is

Observability is how you see what your serverless system is doing through logs, metrics, and traces.

Why we need it

When something breaks, you need evidence, not guesses. Observability makes that possible.

Benefits

Faster diagnosis of errors and slow requests.
Clear visibility across services and regions.
Better reliability because problems are found early.

Tradeoffs

More setup for dashboards and alerts.
Ongoing cost for log storage and tracing.

Real-world examples (architecture only)

Trace shows slow database call -> Fix query.
Alert triggers on error spike -> Rollback.

Trace flow Alert flow

What This Lesson Covers

Logging strategy and structured logs
Metrics that matter (latency, errors, saturation)
Distributed tracing and correlation IDs
Alerting thresholds and SLOs
Multi-cloud visibility patterns

Core Concepts

Logs: Details of a single event
Metrics: Aggregated trends over time
Traces: End-to-end request flow
SLOs: Reliability targets
Alerting: Automated responses to anomalies

Structured Logging (Python)

import json
from datetime import datetime


def log_event(level, message, **data):
	print(json.dumps({
		"timestamp": datetime.utcnow().isoformat(),
		"level": level,
		"message": message,
		**data,
	}))


def handler(event, context):
	log_event("INFO", "Request received", requestId=context.aws_request_id)
	# ...

Correlation IDs

Add a request ID to every log line so you can trace one request across services.

def get_request_id(event, context):
	return event.get("headers", {}).get("x-request-id") or context.aws_request_id

Metrics That Matter

Error rate (percentage of failed requests)
Latency (p50, p95, p99)
Throttles (rate limiting)
Cost drivers (duration, memory, external calls)

Tracing Strategy

Sample a percentage of requests
Always trace errors
Add annotations (userId, region, tier)

Multi-Cloud Observability

Patterns:

Standardize log format (JSON)
Use consistent metric names
Mirror dashboards across clouds

Alerting & SLOs

Define reliability targets and alert on breaches:

API read: p99 < 100ms
API error rate: < 1%
Queue delay: < 2 minutes

Project

Design an observability plan for a serverless API.

Deliverables:

List the logs you will capture
Define 3 key metrics and alert thresholds
Describe how you will trace a request end-to-end

Email your work to [email protected].

References

AWS CloudWatch: https://docs.aws.amazon.com/cloudwatch/
AWS X-Ray: https://docs.aws.amazon.com/xray/
Google Cloud Logging: https://cloud.google.com/logging/docs
Cloud Monitoring: https://cloud.google.com/monitoring/docs
Cloud Trace: https://cloud.google.com/trace/docs

Simple Explanation​

What it is​

Why we need it​

Benefits​

Tradeoffs​

Real-world examples (architecture only)​

What This Lesson Covers​

Core Concepts​

Structured Logging (Python)​

Correlation IDs​

Metrics That Matter​

Tracing Strategy​

Multi-Cloud Observability​

Alerting & SLOs​

Project​

References​