Distributed Tracing: X-Ray (AWS) & Cloud Trace (GCP)

Tracing is observability for request flows. When a serverless function calls multiple services (databases, APIs, storage), distributed tracing shows the complete journey: where time is spent, where errors occur, and which services are slow. Both AWS and GCP provide managed tracing platforms.

Simple Explanation

What it is

Tracing follows a single request as it moves through many services, so you can see the full path and timing.

Why we need it

In serverless, one request can touch five or ten services. Without tracing, you only see isolated logs and cannot connect the dots.

Benefits

End-to-end visibility across services.
Pinpoints bottlenecks quickly.
Speeds up incident response by showing where failures happen.

Tradeoffs

Extra setup for instrumentation.
Sampling means you may not see every request.

Real-world examples (architecture only)

Checkout request -> API -> Lambda -> Database -> Queue.
Image upload -> Storage -> Function -> Thumbnail -> Notify.

Part 1: AWS X-Ray

What Is X-Ray?

AWS X-Ray traces requests across multiple services:

Request hits API Gateway
    ↓ (X-Ray records timing, duration, errors)
Lambda calls DynamoDB
    ↓ (X-Ray records DynamoDB latency)
Lambda calls S3
    ↓ (X-Ray records S3 latency)
Response returned

X-Ray shows: Full service path + timing for each hop

Enable X-Ray

Step 1: Update IAM Role

Add X-Ray write permissions:

HelloWorld:
  Type: AWS::Serverless::Function
  Properties:
    Policies:
      - Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - xray:PutTraceSegments
              - xray:PutTelemetryRecords
            Resource: '*'
    TracingConfig:
      Mode: Active

Step 2: Install X-Ray SDK

pip install aws-xray-sdk

Step 3: Instrument Code

import json
import os

import boto3
from aws_xray_sdk.core import patch_all, xray_recorder

patch_all()

ddb = boto3.client("dynamodb")
s3 = boto3.client("s3")


def handler(event, context):
  segment = xray_recorder.current_segment()
  segment.put_annotation("userId", event.get("userId"))
  segment.put_annotation("environment", os.environ.get("ENV", "dev"))

  dynamo_result = ddb.get_item(
    TableName=os.environ.get("TABLE_NAME"),
    Key={"id": {"S": event.get("id")}},
  )

  s3_result = s3.get_object(
    Bucket=os.environ.get("BUCKET"),
    Key=event.get("id"),
  )

  segment.put_metadata("response", {"dynamoDb": dynamo_result, "s3": "omitted"})

  return {"statusCode": 200, "body": json.dumps({"success": True})}

X-Ray Service Map

Visual representation of your architecture:

X-Ray service map

Query Traces

Filter and find specific traces:

service("myfunction") AND http.status >= 400
// Finds traces where function returned 4xx or 5xx

http.target = "/orders" AND http.method = "POST"
// Finds POST requests to /orders

annotation.userId = "user-123"
// Finds traces for specific user

Performance Analysis

Go to X-Ray Console
Click Service Graph
Hover over a connection to see response times
Click service to drill into specific traces
Identify slow DynamoDB queries, S3 operations, etc.

Error Investigation

Filter: error = true
Click failed trace
See stack trace, error message, exception details

Custom Annotations

Add metadata for filtering:

segment = xray_recorder.current_segment()

segment.put_annotation("orderId", event.get("orderId"))
segment.put_annotation("region", os.environ.get("AWS_REGION"))
segment.put_annotation("tier", event.get("userTier"))  # premium/standard/free

# Later: Filter by tier=premium

Sampling Strategy

X-Ray samples by default (1 per second). For high-volume apps:

SamplingRule:
  Type: AWS::XRay::SamplingRule
  Properties:
    RuleName: MyRule
    Priority: 100
    Version: 1
    ReservedCapacity: 100  # Always trace first 100/sec
    FixedRate: 0.05  # Then trace 5% of remaining
    URLPath: '/api/*'
    Host: 'api.example.com'
    HTTPMethod: 'GET'
    ServiceType: 'aws'
    ServiceName: 'myapp'

Lambda Insights (Lightweight Alternative)

If X-Ray is too heavy, use Lambda Insights:

HelloWorld:
  Type: AWS::Serverless::Function
  Properties:
    Layers:
      - !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:layer:LambdaInsights:1'
    Environment:
      Variables:
        DD_LAMBDA_LOG_LEVEL: INFO

Provides: CPU usage, memory allocation, duration, cold starts (without full tracing).

Part 2: Google Cloud Trace

What Is Cloud Trace?

Google Cloud Trace automatically captures latency data from Cloud Functions and connected services:

Cloud Function receives request
    ↓ (Cloud Trace auto-captures)
Calls Firestore
    ↓ (Firestore latency recorded)
Calls Cloud Storage
    ↓ (Cloud Storage latency recorded)
Returns response

Cloud Trace shows: Service map + latencies for every hop

Enable Cloud Trace

Step 1: Install SDK

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-gcp-trace

Step 2: Initialize at Function Startup

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter

import functions_framework
from google.cloud import firestore
from google.cloud import storage

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
  BatchSpanProcessor(CloudTraceSpanExporter())
)

tracer = trace.get_tracer(__name__)
db = firestore.Client()
storage_client = storage.Client()


@functions_framework.http
def process_order(request):
  payload = request.get_json(silent=True) or {}
  order_id = payload.get("id")

  with tracer.start_as_current_span("process_order"):
    doc = db.collection("orders").document(order_id).get()
    bucket = storage_client.bucket("order-files")
    bucket.blob(f"{order_id}.json").download_as_bytes()

  return ({"success": True, "order": doc.to_dict()}, 200)

Manual Tracing

Add custom spans for application logic:

from opentelemetry import trace


@functions_framework.http
def custom_tracing(request):
  payload = request.get_json(silent=True) or {}

  with tracer.start_as_current_span("process_order"):
    with tracer.start_as_current_span("validate_order"):
      validate_order(payload)

    with tracer.start_as_current_span("save_order"):
      db.collection("orders").add(payload)

  return ({"success": True}, 200)

View Traces in Cloud Console

Open Cloud Trace in Cloud Console
Select Trace List
See recent traces with latency
Click trace to drill down:
- Firestore queries and latency
- Cloud Storage operations
- Cloud Function execution time
- Network latency

Trace Details

Each trace shows:

Trace details

Performance Analysis

Go to Cloud Trace Analysis Reports
View Latency percentiles (median, p95, p99)
Click on slow traces (p99) to understand why
Look for patterns (slow on certain input sizes, users, etc.)

Query & Filter

Use Cloud Logging to find specific traces:

resource.type="cloud_function"
resource.labels.function_name="processOrder"
httpRequest.latency > "1s"

Sampling Rate

Balance cost and visibility:

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Lightweight sampling: 5% of requests
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(0.05)))

# High-traffic app: sample 1%
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(0.01)))

# Development: trace everything
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(1.0)))

Cloud Profiler

For deeper performance analysis, use Cloud Profiler (CPU/memory):

from google.cloud import profiler

profiler.start({
    'service': 'order-processing',
    'service_version': '1.0.0'
})

Auto-profiles CPU and memory, shows which functions take most time.

AWS X-Ray vs. GCP Cloud Trace

Feature	X-Ray	Cloud Trace
Auto-instrumentation	Requires SDK wrapping	Auto-captures most operations
Setup	Install SDK, wrap clients	Import agent at startup
Service map	Requires configuration	Auto-generated from traces
Custom spans	Manual annotation	Manual span creation
Sampling	Configurable rules	Per-second rate or percentage
Pricing	Free tier: 100k traces/month	Free tier: 2.5M trace spans/month
Query language	Custom filter syntax	Cloud Logging syntax
Integration	CloudWatch, SNS alarms	Cloud Logging, Cloud Alerting
Performance profiling	Lambda Insights	Cloud Profiler
Setup complexity	Medium (manual wrapping)	Low (auto-instrumentation)

Key Differences

Auto-instrumentation: Cloud Trace auto-captures operations; X-Ray requires explicit SDK wrapping of every client
Cost: Cloud Trace cheaper for high-volume apps (free tier is larger)
Setup: Cloud Trace is simpler out-of-the-box; X-Ray offers more control
Sampling: X-Ray uses rules (more flexible); Cloud Trace uses per-second rate (simpler)

Best Practices (Both Platforms)

Sample strategically — 100% sampling = high cost; 1-10% gives good visibility
Annotate contextually — Include user ID, request type, environment
Don't trace sensitive data — Exclude passwords, tokens, PII
Correlate with logs — Use request/trace IDs to link traces to log entries
Alert on latency — Set alarms on p99 latency trends
Review traces regularly — Look for slow services, cascading failures
Clean up old traces — Both platforms auto-delete old data (X-Ray: 30 days, Cloud Trace: 90 days)

Hands-On: Multi-Cloud Tracing

AWS X-Ray

Deploy function with tracing enabled:

sam deploy --template-file template.yaml

Invoke function multiple times:

aws lambda invoke \
  --function-name ProcessOrder \
  --payload '{"orderId":"123"}' \
  response.json

View service map in X-Ray Console
Click on traces to see latency breakdown

Google Cloud Trace

Deploy function:

gcloud functions deploy processOrder \
  --runtime python312 \
  --trigger-http \
  --allow-unauthenticated

Invoke function:

curl https://YOUR-FUNCTION-URL -X POST -d '{"orderId":"123"}'

View traces in Cloud Trace > Trace List
Click trace to see breakdown of Firestore/Cloud Storage latencies

Key Takeaway

Tracing turns distributed systems from black boxes into glass boxes. Instrument your functions, sample smartly, and review traces regularly to understand where requests spend time and where they fail. Both X-Ray and Cloud Trace provide the visibility—your job is interpreting what you see and acting on it.

Simple Explanation​

What it is​

Why we need it​

Benefits​

Tradeoffs​

Real-world examples (architecture only)​

Part 1: AWS X-Ray​

What Is X-Ray?​

Enable X-Ray​

X-Ray Service Map​

Query Traces​

Performance Analysis​

Error Investigation​

Custom Annotations​

Sampling Strategy​

Lambda Insights (Lightweight Alternative)​

Part 2: Google Cloud Trace​

What Is Cloud Trace?​

Enable Cloud Trace​

Manual Tracing​

View Traces in Cloud Console​

Trace Details​

Performance Analysis​

Query & Filter​

Sampling Rate​

Cloud Profiler​

AWS X-Ray vs. GCP Cloud Trace​

Key Differences​

Best Practices (Both Platforms)​

Hands-On: Multi-Cloud Tracing​

AWS X-Ray​

Google Cloud Trace​

Key Takeaway​

Simple Explanation

What it is

Why we need it

Benefits

Tradeoffs

Real-world examples (architecture only)

Part 1: AWS X-Ray

What Is X-Ray?

Enable X-Ray

X-Ray Service Map

Query Traces

Performance Analysis

Error Investigation

Custom Annotations

Sampling Strategy

Lambda Insights (Lightweight Alternative)

Part 2: Google Cloud Trace

What Is Cloud Trace?

Enable Cloud Trace

Manual Tracing

View Traces in Cloud Console

Trace Details

Performance Analysis

Query & Filter

Sampling Rate

Cloud Profiler

AWS X-Ray vs. GCP Cloud Trace

Key Differences

Best Practices (Both Platforms)

Hands-On: Multi-Cloud Tracing

AWS X-Ray

Google Cloud Trace

Key Takeaway