Distributed Tracing: X-Ray (AWS) & Cloud Trace (GCP)
Tracing is observability for request flows. When a serverless function calls multiple services (databases, APIs, storage), distributed tracing shows the complete journey: where time is spent, where errors occur, and which services are slow. Both AWS and GCP provide managed tracing platforms.
Simple Explanation
What it is
Tracing follows a single request as it moves through many services, so you can see the full path and timing.
Why we need it
In serverless, one request can touch five or ten services. Without tracing, you only see isolated logs and cannot connect the dots.
Benefits
- End-to-end visibility across services.
- Pinpoints bottlenecks quickly.
- Speeds up incident response by showing where failures happen.
Tradeoffs
- Extra setup for instrumentation.
- Sampling means you may not see every request.
Real-world examples (architecture only)
- Checkout request -> API -> Lambda -> Database -> Queue.
- Image upload -> Storage -> Function -> Thumbnail -> Notify.
Part 1: AWS X-Ray
What Is X-Ray?
AWS X-Ray traces requests across multiple services:
Request hits API Gateway
↓ (X-Ray records timing, duration, errors)
Lambda calls DynamoDB
↓ (X-Ray records DynamoDB latency)
Lambda calls S3
↓ (X-Ray records S3 latency)
Response returned
X-Ray shows: Full service path + timing for each hop
Enable X-Ray
Step 1: Update IAM Role
Add X-Ray write permissions:
HelloWorld:
Type: AWS::Serverless::Function
Properties:
Policies:
- Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- xray:PutTraceSegments
- xray:PutTelemetryRecords
Resource: '*'
TracingConfig:
Mode: Active
Step 2: Install X-Ray SDK
pip install aws-xray-sdk
Step 3: Instrument Code
import json
import os
import boto3
from aws_xray_sdk.core import patch_all, xray_recorder
patch_all()
ddb = boto3.client("dynamodb")
s3 = boto3.client("s3")
def handler(event, context):
segment = xray_recorder.current_segment()
segment.put_annotation("userId", event.get("userId"))
segment.put_annotation("environment", os.environ.get("ENV", "dev"))
dynamo_result = ddb.get_item(
TableName=os.environ.get("TABLE_NAME"),
Key={"id": {"S": event.get("id")}},
)
s3_result = s3.get_object(
Bucket=os.environ.get("BUCKET"),
Key=event.get("id"),
)
segment.put_metadata("response", {"dynamoDb": dynamo_result, "s3": "omitted"})
return {"statusCode": 200, "body": json.dumps({"success": True})}
X-Ray Service Map
Visual representation of your architecture:
Query Traces
Filter and find specific traces:
service("myfunction") AND http.status >= 400
// Finds traces where function returned 4xx or 5xx
http.target = "/orders" AND http.method = "POST"
// Finds POST requests to /orders
annotation.userId = "user-123"
// Finds traces for specific user
Performance Analysis
- Go to X-Ray Console
- Click Service Graph
- Hover over a connection to see response times
- Click service to drill into specific traces
- Identify slow DynamoDB queries, S3 operations, etc.
Error Investigation
- Filter:
error = true - Click failed trace
- See stack trace, error message, exception details
Custom Annotations
Add metadata for filtering:
segment = xray_recorder.current_segment()
segment.put_annotation("orderId", event.get("orderId"))
segment.put_annotation("region", os.environ.get("AWS_REGION"))
segment.put_annotation("tier", event.get("userTier")) # premium/standard/free
# Later: Filter by tier=premium
Sampling Strategy
X-Ray samples by default (1 per second). For high-volume apps:
SamplingRule:
Type: AWS::XRay::SamplingRule
Properties:
RuleName: MyRule
Priority: 100
Version: 1
ReservedCapacity: 100 # Always trace first 100/sec
FixedRate: 0.05 # Then trace 5% of remaining
URLPath: '/api/*'
Host: 'api.example.com'
HTTPMethod: 'GET'
ServiceType: 'aws'
ServiceName: 'myapp'
Lambda Insights (Lightweight Alternative)
If X-Ray is too heavy, use Lambda Insights:
HelloWorld:
Type: AWS::Serverless::Function
Properties:
Layers:
- !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:layer:LambdaInsights:1'
Environment:
Variables:
DD_LAMBDA_LOG_LEVEL: INFO
Provides: CPU usage, memory allocation, duration, cold starts (without full tracing).
Part 2: Google Cloud Trace
What Is Cloud Trace?
Google Cloud Trace automatically captures latency data from Cloud Functions and connected services:
Cloud Function receives request
↓ (Cloud Trace auto-captures)
Calls Firestore
↓ (Firestore latency recorded)
Calls Cloud Storage
↓ (Cloud Storage latency recorded)
Returns response
Cloud Trace shows: Service map + latencies for every hop
Enable Cloud Trace
Step 1: Install SDK
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-gcp-trace
Step 2: Initialize at Function Startup
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
import functions_framework
from google.cloud import firestore
from google.cloud import storage
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(CloudTraceSpanExporter())
)
tracer = trace.get_tracer(__name__)
db = firestore.Client()
storage_client = storage.Client()
@functions_framework.http
def process_order(request):
payload = request.get_json(silent=True) or {}
order_id = payload.get("id")
with tracer.start_as_current_span("process_order"):
doc = db.collection("orders").document(order_id).get()
bucket = storage_client.bucket("order-files")
bucket.blob(f"{order_id}.json").download_as_bytes()
return ({"success": True, "order": doc.to_dict()}, 200)
Manual Tracing
Add custom spans for application logic:
from opentelemetry import trace
@functions_framework.http
def custom_tracing(request):
payload = request.get_json(silent=True) or {}
with tracer.start_as_current_span("process_order"):
with tracer.start_as_current_span("validate_order"):
validate_order(payload)
with tracer.start_as_current_span("save_order"):
db.collection("orders").add(payload)
return ({"success": True}, 200)
View Traces in Cloud Console
- Open Cloud Trace in Cloud Console
- Select Trace List
- See recent traces with latency
- Click trace to drill down:
- Firestore queries and latency
- Cloud Storage operations
- Cloud Function execution time
- Network latency
Trace Details
Each trace shows:
Performance Analysis
- Go to Cloud Trace Analysis Reports
- View Latency percentiles (median, p95, p99)
- Click on slow traces (p99) to understand why
- Look for patterns (slow on certain input sizes, users, etc.)
Query & Filter
Use Cloud Logging to find specific traces:
resource.type="cloud_function"
resource.labels.function_name="processOrder"
httpRequest.latency > "1s"
Sampling Rate
Balance cost and visibility:
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Lightweight sampling: 5% of requests
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(0.05)))
# High-traffic app: sample 1%
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(0.01)))
# Development: trace everything
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(1.0)))
Cloud Profiler
For deeper performance analysis, use Cloud Profiler (CPU/memory):
from google.cloud import profiler
profiler.start({
'service': 'order-processing',
'service_version': '1.0.0'
})
Auto-profiles CPU and memory, shows which functions take most time.
AWS X-Ray vs. GCP Cloud Trace
| Feature | X-Ray | Cloud Trace |
|---|---|---|
| Auto-instrumentation | Requires SDK wrapping | Auto-captures most operations |
| Setup | Install SDK, wrap clients | Import agent at startup |
| Service map | Requires configuration | Auto-generated from traces |
| Custom spans | Manual annotation | Manual span creation |
| Sampling | Configurable rules | Per-second rate or percentage |
| Pricing | Free tier: 100k traces/month | Free tier: 2.5M trace spans/month |
| Query language | Custom filter syntax | Cloud Logging syntax |
| Integration | CloudWatch, SNS alarms | Cloud Logging, Cloud Alerting |
| Performance profiling | Lambda Insights | Cloud Profiler |
| Setup complexity | Medium (manual wrapping) | Low (auto-instrumentation) |
Key Differences
- Auto-instrumentation: Cloud Trace auto-captures operations; X-Ray requires explicit SDK wrapping of every client
- Cost: Cloud Trace cheaper for high-volume apps (free tier is larger)
- Setup: Cloud Trace is simpler out-of-the-box; X-Ray offers more control
- Sampling: X-Ray uses rules (more flexible); Cloud Trace uses per-second rate (simpler)
Best Practices (Both Platforms)
- Sample strategically — 100% sampling = high cost; 1-10% gives good visibility
- Annotate contextually — Include user ID, request type, environment
- Don't trace sensitive data — Exclude passwords, tokens, PII
- Correlate with logs — Use request/trace IDs to link traces to log entries
- Alert on latency — Set alarms on p99 latency trends
- Review traces regularly — Look for slow services, cascading failures
- Clean up old traces — Both platforms auto-delete old data (X-Ray: 30 days, Cloud Trace: 90 days)
Hands-On: Multi-Cloud Tracing
AWS X-Ray
- Deploy function with tracing enabled:
sam deploy --template-file template.yaml
- Invoke function multiple times:
aws lambda invoke \
--function-name ProcessOrder \
--payload '{"orderId":"123"}' \
response.json
- View service map in X-Ray Console
- Click on traces to see latency breakdown
Google Cloud Trace
- Deploy function:
gcloud functions deploy processOrder \
--runtime python312 \
--trigger-http \
--allow-unauthenticated
- Invoke function:
curl https://YOUR-FUNCTION-URL -X POST -d '{"orderId":"123"}'
- View traces in Cloud Trace > Trace List
- Click trace to see breakdown of Firestore/Cloud Storage latencies
Key Takeaway
Tracing turns distributed systems from black boxes into glass boxes. Instrument your functions, sample smartly, and review traces regularly to understand where requests spend time and where they fail. Both X-Ray and Cloud Trace provide the visibility—your job is interpreting what you see and acting on it.