Skip to main content

Debugging: AWS & GCP Strategies

Serverless applications are hard to debug—you can't SSH into the runtime, can't attach to a process. Instead, you rely on logs, metrics, and distributed tracing. Both AWS and GCP offer tools and best practices for finding and fixing issues quickly.


Simple Explanation

What it is

Debugging is the process of finding the real cause of a problem and proving the fix works.

Why we need it

In serverless, failures are often spread across services. You need a structured approach so you do not guess and hope.

Benefits

  • Faster root-cause discovery when incidents happen.
  • Less downtime because fixes are targeted.
  • Better confidence when shipping changes.

Tradeoffs

  • More tooling to learn (logs, traces, metrics).
  • Requires discipline to reproduce issues properly.

Real-world examples (architecture only)

  • Bug in payment flow -> Trace shows failure in third-party API.
  • Timeout spike -> Logs show slow database query.

Part 1: AWS Debugging

Debugging Strategies

1. Reproduction

Can you reproduce the issue locally?

# Get the exact event from CloudWatch
aws logs get-log-events \
--log-group-name /aws/lambda/myfunction \
--log-stream-name '2026/02/08/[$LATEST]abc123'

# Copy event JSON
# Test locally with SAM
sam local invoke MyFunction -e event.json

2. Isolation

Test components independently:

# Test Lambda handler separately
from index import handler

event = {"id": "123"}
print(handler(event, None))

# Test database connection
from db import connect_db

try:
connect_db()
print("Connected")
except Exception as exc:
print(f"Connection failed: {exc}")

# Test external API
from api import call_api

print(call_api("https://api.example.com"))

3. Add Logging

Methodically add logs to narrow down the issue:

import json


def handler(event, context):
print("1. Entry - Event:", event)

data = json.loads(event.get("body") or "{}")
print("2. Parsed - Data:", data)

result = database.query(data)
print("3. Query - Result:", result)

formatted = format_result(result)
print("4. Formatted:", formatted)

return formatted

Gradually remove logs as you understand the flow.

4. Breakpoint Debugging

Full IDE debugging with SAM:

sam local start-api --debug

In VS Code, attach debugger:

// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Attach to SAM (Python)",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5890
}
}
]
}

Common Issues & Fixes

Lambda Timeout

Function takes too long:

Symptoms:

  • "Task timed out after X seconds"
  • Incomplete logs

Debug:

import time

start_time = time.time()


def handler(event, context):
print("Duration so far:", int((time.time() - start_time) * 1000), "ms")

# Your code

print("Final duration:", int((time.time() - start_time) * 1000), "ms")

Fix:

  • Increase timeout in Lambda config
  • Optimize slow operations
  • Use async/await properly
  • Parallelize operations

Out of Memory

Function uses too much memory:

Symptoms:

  • "Process exited before completing request"
  • Sudden termination

Debug:

import resource

usage_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print("Memory usage (KB):", usage_kb)

Fix:

  • Stream large files instead of loading in memory
  • Release unused references
  • Increase Lambda memory allocation
  • Use appropriate data structures

Permission Denied

IAM role lacks permissions:

Symptoms:

  • "User: arn:aws:iam::... is not authorized to perform: s3:GetObject"

Debug: Check Lambda execution role:

aws iam get-role-policy \
--role-name MyLambdaRole \
--policy-name S3Access

Fix: Add permission to role:

{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}

Cold Start Delays

First invocation is slow:

Symptoms:

  • First request takes 1-2 seconds
  • Subsequent requests are fast

Debug:

duration_ms = int((time.time() - start_time) * 1000)
print("Duration:", duration_ms)

# Cold start is typically higher than warm starts

Fix:

  • Optimize code bundle size (remove unused dependencies)
  • Prefer lightweight runtimes (Python over Java)
  • Reduce VPC overhead (if using VPC)
  • Set provisioned concurrency for critical functions

DynamoDB Not Found

Cannot access DynamoDB table:

Symptoms:

  • "ResourceNotFoundException"
  • "Requested resource not found"

Debug:

# Verify table exists
aws dynamodb describe-table --table-name Items

# Verify Lambda can access it (IAM check)
# Verify table name matches

Fix:

  • Check table name spelling (case-sensitive)
  • Verify Lambda IAM role has dynamodb:GetItem, etc.
  • Check table is in same region as Lambda

Debugging Tools

AWS X-Ray

[Lesson 5] covers this in detail. Enables distributed tracing.

CloudWatch Logs Insights

Query logs to find patterns:

fields @timestamp, @message, @duration
| filter @message like /ERROR/
| stats count() as errors, avg(@duration) by @logStream

AWS Lambda Insights

CloudWatch extension for performance:

  1. Add extension to Lambda
  2. View metrics: CPU, memory allocation, duration
  3. Identify performance bottlenecks

SAM Local Debugging

Debug locally before deployment:

# Run function locally with event
sam local invoke MyFunction -e event.json

# Start API locally with auto-reload
sam local start-api --debug

# Attach IDE debugger to port 5858

Remote Debugging

Debug production issues with temporary logging:

import json
import os

DEBUG = os.environ.get("DEBUG") == "true"


def handler(event, context):
if DEBUG:
print("Full event:", json.dumps(event, indent=2))
print("All env vars:", dict(os.environ))
# Your code

Enable for specific invocation:

aws lambda update-function-configuration \
--function-name MyFunction \
--environment Variables={DEBUG=true}

# Test
curl https://api.example.com/test

# Disable
aws lambda update-function-configuration \
--function-name MyFunction \
--environment Variables={DEBUG=false}

Part 2: GCP Debugging

GCP Debugging Strategies

1. Reproduction

Get the exact triggering event and test locally:

# View function execution logs
gcloud functions logs read my-function --limit 50

# Export specific log entries
gcloud logging read 'resource.type="cloud_function"' \
--format json > logs.json

# Test locally with Functions Framework
functions-framework --target myFunction --debug

2. Isolation

Test components independently:

from db import connect_firestore
from api import call_api

try:
connect_firestore()
print("Connected")
except Exception as exc:
print(f"Connection failed: {exc}")

print(call_api("https://api.example.com"))

3. Cloud Debugger

Attach a real-time debugger to your function:

import googleclouddebugger

googleclouddebugger.enable(
service="my-function",
version="1.0.0",
)


def my_function(request):
print("Request:", request.get_json(silent=True))
# Set breakpoints in Cloud Console
return ("Hello", 200)

In Cloud Console, browse source code, set breakpoints, inspect variables.

4. Structured Logging for Debugging

Use JSON-formatted logs for powerful filtering:

import json
import uuid

import functions_framework
from google.cloud import logging as cloud_logging

logging_client = cloud_logging.Client()
log = logging_client.logger("debug-logs")


@functions_framework.http
def debug_demo(request):
request_id = request.headers.get("x-request-id", str(uuid.uuid4()))

log.log_struct({
"requestId": request_id,
"message": "Request received",
"method": request.method,
"path": request.path,
}, severity="DEBUG")

try:
data = request.get_json(silent=True) or {}
log.log_struct({
"requestId": request_id,
"message": "Body parsed",
"data": data,
}, severity="DEBUG")

result = process_data(data)
log.log_struct({
"requestId": request_id,
"message": "Processing complete",
"result": result,
}, severity="INFO")

return (json.dumps(result), 200)
except Exception as exc:
log.log_struct({
"requestId": request_id,
"message": "Error occurred",
"error": str(exc),
}, severity="ERROR")

return ({"error": str(exc)}, 500)

Cloud Logging Log Explorer

Filter and search logs in Cloud Console:

resource.type="cloud_function"
resource.labels.function_name="my-function"
severity="ERROR"
jsonPayload.requestId="abc-123"

Cloud Profiler

Identify performance bottlenecks:

from google.cloud import profiler

profiler.start({
'service': 'my-function',
'service_version': '1.0'
})

Generate CPU and memory profiles automatically.


Common Issues & Fixes

Issue 1: Timeout

Function takes too long:

AWS Symptoms:

  • "Task timed out after X seconds"
  • CloudWatch shows incomplete logs

AWS Debug:

import time

start_time = time.time()


def handler(event, context):
print("Running for", int((time.time() - start_time) * 1000), "ms")
# Your slow code
print("Total time", int((time.time() - start_time) * 1000), "ms")

GCP Symptoms:

  • "Error: function execution timeout"
  • Cloud Logging shows incomplete traces

GCP Debug:

import time
from datetime import datetime


def my_function(request):
start = time.time()

print("Starting at", datetime.utcnow().isoformat())
# Your slow code
print("Completed in", int((time.time() - start) * 1000), "ms")

return ("Done", 200)

Fix (Both):

  • Increase timeout setting
  • Optimize slow database queries
  • Use async operations properly
  • Parallelize independent tasks

Issue 2: Out of Memory

AWS Symptoms:

  • "Process exited before completing request"
  • Sudden termination in logs

AWS Debug:

import resource

usage_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print("Memory (KB):", usage_kb)

GCP Symptoms:

  • "Error: resource exhausted"
  • Function crashes with no error message

GCP Debug:

import resource
import time


while True:
usage_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory: {round(usage_kb / 1024, 2)}MB")
time.sleep(1)

Fix (Both):

  • Stream large files instead of loading entire file in memory
  • Release unused object references
  • Increase memory allocation
  • Use appropriate data structures (Set vs Array for millions of items)

Issue 3: Permission Denied

AWS Symptoms:

  • "User: arn:aws:iam::... is not authorized to perform: s3:GetObject"

AWS Debug:

aws iam get-role-policy --role-name MyLambdaRole --policy-name S3Access

GCP Symptoms:

  • "Error: permission denied on resource"
  • "Cloud IAM says you don't have access to..."

GCP Debug:

gcloud functions describe my-function --format=json | grep serviceAccountEmail
gcloud iam service-accounts get-iam-policy SERVICE_ACCOUNT

Fix (AWS):

{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}

Fix (GCP):

gcloud projects add-iam-policy-binding PROJECT_ID \
--member serviceAccount:SA_EMAIL \
--role roles/storage.objectViewer

Issue 4: Cold Start Delays

AWS Symptoms:

  • First request takes 1-2 seconds
  • CloudWatch shows high Duration for first invocation

GCP Symptoms:

  • First request takes 0.5-2 seconds
  • Subsequent requests are fast

Debug (Both): Check logs for first vs second invocation. Cold starts are normal but can be optimized.

Fix (Both):

  • Reduce code bundle size (remove unused dependencies)
  • Use lightweight runtimes (Python → Java)
  • Minimize initialization code outside handler
  • AWS: Use provisioned concurrency
  • GCP: Use min-instances setting

Issue 5: Firestore/DynamoDB Not Found

AWS Symptoms:

  • "ResourceNotFoundException: Requested resource not found"

AWS Debug:

aws dynamodb describe-table --table-name Items
aws iam get-role-policy --role-name LambdaRole --policy-name DynamoDB

GCP Symptoms:

  • "Error: failed to get document"
  • "PERMISSION_DENIED: permission denied"

GCP Debug:

gcloud firestore databases list
gcloud iam service-accounts get-iam-policy \
FUNCTION_SERVICE_ACCOUNT

Fix (AWS):

  • Check table name (case-sensitive)
  • Verify Lambda IAM role has dynamodb:* permissions
  • Verify table exists in same region

Fix (GCP):

  • Check Firestore database is initialized
  • Verify service account has roles/datastore.user role
  • Check database location is correct

AWS vs. GCP Debugging Tools

Tool/CapabilityAWSGCP
Local testingSAM CLI, docker-lambdaFunctions Framework
IDE debuggingVS Code + SAM, IntelliJCloud Code (VS Code)
Production debuggingX-Ray (distributed tracing)Cloud Debugger, Cloud Profiler
Log queryingCloudWatch Logs InsightsCloud Logging Log Explorer
Performance profilingLambda InsightsCloud Profiler
Live breakpointsCloudWatch RUMCloud Debugger
Error trackingCloudWatch + third-partyError Reporting
Memory/CPU graphsCloudWatch metricsCloud Monitoring

Key Differences

  • Local debugging: SAM offers more mature tooling; GCP uses Functions Framework (simpler)
  • Remote debugging: AWS uses X-Ray for tracing; GCP uses Cloud Debugger for live breakpoints
  • Log analysis: CloudWatch Insights uses custom query language; Cloud Logging uses simpler filter syntax
  • Performance: AWS has Lambda Insights extension; GCP has Cloud Profiler

Best Practices (Both Platforms)

  1. Log liberally in development — Be conservative in production
  2. Use log levels — DEBUG, INFO, WARN, ERROR (structured logging)
  3. Include context — Request ID, user ID, timestamps in every log
  4. Test error paths — Don't just test happy path
  5. Keep previous versions — For quick rollback during intensive debugging
  6. Correlate traces — Use request/trace IDs to follow requests across services
  7. Monitor memory — Set alerts for high memory usage trends
  8. Profile in production — GCP Profiler and AWS X-Ray work on real traffic

Hands-On: Multi-Cloud Debugging

AWS Lambda

  1. Create function with intentional bug:
aws lambda create-function \
--function-name debug-demo \
--runtime python3.12 \
--handler lambda_function.handler \
--zip-file fileb://function.zip
  1. Invoke and check logs:
aws lambda invoke \
--function-name debug-demo \
--payload '{"id": "123"}' \
response.json

aws logs tail /aws/lambda/debug-demo --follow
  1. Add debug logging and redeploy:
aws lambda update-function-code \
--function-name debug-demo \
--zip-file fileb://function-v2.zip

Google Cloud

  1. Deploy function:
gcloud functions deploy debug-demo \
--runtime python312 \
--trigger-http \
--allow-unauthenticated
  1. View logs:
gcloud functions logs read debug-demo --limit 50 --follow
  1. Query with Cloud Logging:
gcloud logging read \
'resource.type="cloud_function" AND resource.labels.function_name="debug-demo"' \
--limit 50

Key Takeaway

Debugging serverless requires different tools and mindset than traditional development. Structured logging is your best friend—add request IDs, log at every decision point, and use your platform's querying tools to find patterns. Local simulation helps catch issues early; production debugging relies on logs, metrics, and trace IDs.