Debugging: AWS & GCP Strategies
Serverless applications are hard to debug—you can't SSH into the runtime, can't attach to a process. Instead, you rely on logs, metrics, and distributed tracing. Both AWS and GCP offer tools and best practices for finding and fixing issues quickly.
Simple Explanation
What it is
Debugging is the process of finding the real cause of a problem and proving the fix works.
Why we need it
In serverless, failures are often spread across services. You need a structured approach so you do not guess and hope.
Benefits
- Faster root-cause discovery when incidents happen.
- Less downtime because fixes are targeted.
- Better confidence when shipping changes.
Tradeoffs
- More tooling to learn (logs, traces, metrics).
- Requires discipline to reproduce issues properly.
Real-world examples (architecture only)
- Bug in payment flow -> Trace shows failure in third-party API.
- Timeout spike -> Logs show slow database query.
Part 1: AWS Debugging
Debugging Strategies
1. Reproduction
Can you reproduce the issue locally?
# Get the exact event from CloudWatch
aws logs get-log-events \
--log-group-name /aws/lambda/myfunction \
--log-stream-name '2026/02/08/[$LATEST]abc123'
# Copy event JSON
# Test locally with SAM
sam local invoke MyFunction -e event.json
2. Isolation
Test components independently:
# Test Lambda handler separately
from index import handler
event = {"id": "123"}
print(handler(event, None))
# Test database connection
from db import connect_db
try:
connect_db()
print("Connected")
except Exception as exc:
print(f"Connection failed: {exc}")
# Test external API
from api import call_api
print(call_api("https://api.example.com"))
3. Add Logging
Methodically add logs to narrow down the issue:
import json
def handler(event, context):
print("1. Entry - Event:", event)
data = json.loads(event.get("body") or "{}")
print("2. Parsed - Data:", data)
result = database.query(data)
print("3. Query - Result:", result)
formatted = format_result(result)
print("4. Formatted:", formatted)
return formatted
Gradually remove logs as you understand the flow.
4. Breakpoint Debugging
Full IDE debugging with SAM:
sam local start-api --debug
In VS Code, attach debugger:
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Attach to SAM (Python)",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5890
}
}
]
}
Common Issues & Fixes
Lambda Timeout
Function takes too long:
Symptoms:
- "Task timed out after X seconds"
- Incomplete logs
Debug:
import time
start_time = time.time()
def handler(event, context):
print("Duration so far:", int((time.time() - start_time) * 1000), "ms")
# Your code
print("Final duration:", int((time.time() - start_time) * 1000), "ms")
Fix:
- Increase timeout in Lambda config
- Optimize slow operations
- Use async/await properly
- Parallelize operations
Out of Memory
Function uses too much memory:
Symptoms:
- "Process exited before completing request"
- Sudden termination
Debug:
import resource
usage_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print("Memory usage (KB):", usage_kb)
Fix:
- Stream large files instead of loading in memory
- Release unused references
- Increase Lambda memory allocation
- Use appropriate data structures
Permission Denied
IAM role lacks permissions:
Symptoms:
- "User: arn:aws:iam::... is not authorized to perform: s3:GetObject"
Debug: Check Lambda execution role:
aws iam get-role-policy \
--role-name MyLambdaRole \
--policy-name S3Access
Fix: Add permission to role:
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}
Cold Start Delays
First invocation is slow:
Symptoms:
- First request takes 1-2 seconds
- Subsequent requests are fast
Debug:
duration_ms = int((time.time() - start_time) * 1000)
print("Duration:", duration_ms)
# Cold start is typically higher than warm starts
Fix:
- Optimize code bundle size (remove unused dependencies)
- Prefer lightweight runtimes (Python over Java)
- Reduce VPC overhead (if using VPC)
- Set provisioned concurrency for critical functions
DynamoDB Not Found
Cannot access DynamoDB table:
Symptoms:
- "ResourceNotFoundException"
- "Requested resource not found"
Debug:
# Verify table exists
aws dynamodb describe-table --table-name Items
# Verify Lambda can access it (IAM check)
# Verify table name matches
Fix:
- Check table name spelling (case-sensitive)
- Verify Lambda IAM role has
dynamodb:GetItem, etc. - Check table is in same region as Lambda
Debugging Tools
AWS X-Ray
[Lesson 5] covers this in detail. Enables distributed tracing.
CloudWatch Logs Insights
Query logs to find patterns:
fields @timestamp, @message, @duration
| filter @message like /ERROR/
| stats count() as errors, avg(@duration) by @logStream
AWS Lambda Insights
CloudWatch extension for performance:
- Add extension to Lambda
- View metrics: CPU, memory allocation, duration
- Identify performance bottlenecks
SAM Local Debugging
Debug locally before deployment:
# Run function locally with event
sam local invoke MyFunction -e event.json
# Start API locally with auto-reload
sam local start-api --debug
# Attach IDE debugger to port 5858
Remote Debugging
Debug production issues with temporary logging:
import json
import os
DEBUG = os.environ.get("DEBUG") == "true"
def handler(event, context):
if DEBUG:
print("Full event:", json.dumps(event, indent=2))
print("All env vars:", dict(os.environ))
# Your code
Enable for specific invocation:
aws lambda update-function-configuration \
--function-name MyFunction \
--environment Variables={DEBUG=true}
# Test
curl https://api.example.com/test
# Disable
aws lambda update-function-configuration \
--function-name MyFunction \
--environment Variables={DEBUG=false}
Part 2: GCP Debugging
GCP Debugging Strategies
1. Reproduction
Get the exact triggering event and test locally:
# View function execution logs
gcloud functions logs read my-function --limit 50
# Export specific log entries
gcloud logging read 'resource.type="cloud_function"' \
--format json > logs.json
# Test locally with Functions Framework
functions-framework --target myFunction --debug
2. Isolation
Test components independently:
from db import connect_firestore
from api import call_api
try:
connect_firestore()
print("Connected")
except Exception as exc:
print(f"Connection failed: {exc}")
print(call_api("https://api.example.com"))
3. Cloud Debugger
Attach a real-time debugger to your function:
import googleclouddebugger
googleclouddebugger.enable(
service="my-function",
version="1.0.0",
)
def my_function(request):
print("Request:", request.get_json(silent=True))
# Set breakpoints in Cloud Console
return ("Hello", 200)
In Cloud Console, browse source code, set breakpoints, inspect variables.
4. Structured Logging for Debugging
Use JSON-formatted logs for powerful filtering:
import json
import uuid
import functions_framework
from google.cloud import logging as cloud_logging
logging_client = cloud_logging.Client()
log = logging_client.logger("debug-logs")
@functions_framework.http
def debug_demo(request):
request_id = request.headers.get("x-request-id", str(uuid.uuid4()))
log.log_struct({
"requestId": request_id,
"message": "Request received",
"method": request.method,
"path": request.path,
}, severity="DEBUG")
try:
data = request.get_json(silent=True) or {}
log.log_struct({
"requestId": request_id,
"message": "Body parsed",
"data": data,
}, severity="DEBUG")
result = process_data(data)
log.log_struct({
"requestId": request_id,
"message": "Processing complete",
"result": result,
}, severity="INFO")
return (json.dumps(result), 200)
except Exception as exc:
log.log_struct({
"requestId": request_id,
"message": "Error occurred",
"error": str(exc),
}, severity="ERROR")
return ({"error": str(exc)}, 500)
Cloud Logging Log Explorer
Filter and search logs in Cloud Console:
resource.type="cloud_function"
resource.labels.function_name="my-function"
severity="ERROR"
jsonPayload.requestId="abc-123"
Cloud Profiler
Identify performance bottlenecks:
from google.cloud import profiler
profiler.start({
'service': 'my-function',
'service_version': '1.0'
})
Generate CPU and memory profiles automatically.
Common Issues & Fixes
Issue 1: Timeout
Function takes too long:
AWS Symptoms:
- "Task timed out after X seconds"
- CloudWatch shows incomplete logs
AWS Debug:
import time
start_time = time.time()
def handler(event, context):
print("Running for", int((time.time() - start_time) * 1000), "ms")
# Your slow code
print("Total time", int((time.time() - start_time) * 1000), "ms")
GCP Symptoms:
- "Error: function execution timeout"
- Cloud Logging shows incomplete traces
GCP Debug:
import time
from datetime import datetime
def my_function(request):
start = time.time()
print("Starting at", datetime.utcnow().isoformat())
# Your slow code
print("Completed in", int((time.time() - start) * 1000), "ms")
return ("Done", 200)
Fix (Both):
- Increase timeout setting
- Optimize slow database queries
- Use async operations properly
- Parallelize independent tasks
Issue 2: Out of Memory
AWS Symptoms:
- "Process exited before completing request"
- Sudden termination in logs
AWS Debug:
import resource
usage_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print("Memory (KB):", usage_kb)
GCP Symptoms:
- "Error: resource exhausted"
- Function crashes with no error message
GCP Debug:
import resource
import time
while True:
usage_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Memory: {round(usage_kb / 1024, 2)}MB")
time.sleep(1)
Fix (Both):
- Stream large files instead of loading entire file in memory
- Release unused object references
- Increase memory allocation
- Use appropriate data structures (Set vs Array for millions of items)
Issue 3: Permission Denied
AWS Symptoms:
- "User: arn:aws:iam::... is not authorized to perform: s3:GetObject"
AWS Debug:
aws iam get-role-policy --role-name MyLambdaRole --policy-name S3Access
GCP Symptoms:
- "Error: permission denied on resource"
- "Cloud IAM says you don't have access to..."
GCP Debug:
gcloud functions describe my-function --format=json | grep serviceAccountEmail
gcloud iam service-accounts get-iam-policy SERVICE_ACCOUNT
Fix (AWS):
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}
Fix (GCP):
gcloud projects add-iam-policy-binding PROJECT_ID \
--member serviceAccount:SA_EMAIL \
--role roles/storage.objectViewer
Issue 4: Cold Start Delays
AWS Symptoms:
- First request takes 1-2 seconds
- CloudWatch shows high Duration for first invocation
GCP Symptoms:
- First request takes 0.5-2 seconds
- Subsequent requests are fast
Debug (Both): Check logs for first vs second invocation. Cold starts are normal but can be optimized.
Fix (Both):
- Reduce code bundle size (remove unused dependencies)
- Use lightweight runtimes (Python → Java)
- Minimize initialization code outside handler
- AWS: Use provisioned concurrency
- GCP: Use min-instances setting
Issue 5: Firestore/DynamoDB Not Found
AWS Symptoms:
- "ResourceNotFoundException: Requested resource not found"
AWS Debug:
aws dynamodb describe-table --table-name Items
aws iam get-role-policy --role-name LambdaRole --policy-name DynamoDB
GCP Symptoms:
- "Error: failed to get document"
- "PERMISSION_DENIED: permission denied"
GCP Debug:
gcloud firestore databases list
gcloud iam service-accounts get-iam-policy \
FUNCTION_SERVICE_ACCOUNT
Fix (AWS):
- Check table name (case-sensitive)
- Verify Lambda IAM role has
dynamodb:*permissions - Verify table exists in same region
Fix (GCP):
- Check Firestore database is initialized
- Verify service account has
roles/datastore.userrole - Check database location is correct
AWS vs. GCP Debugging Tools
| Tool/Capability | AWS | GCP |
|---|---|---|
| Local testing | SAM CLI, docker-lambda | Functions Framework |
| IDE debugging | VS Code + SAM, IntelliJ | Cloud Code (VS Code) |
| Production debugging | X-Ray (distributed tracing) | Cloud Debugger, Cloud Profiler |
| Log querying | CloudWatch Logs Insights | Cloud Logging Log Explorer |
| Performance profiling | Lambda Insights | Cloud Profiler |
| Live breakpoints | CloudWatch RUM | Cloud Debugger |
| Error tracking | CloudWatch + third-party | Error Reporting |
| Memory/CPU graphs | CloudWatch metrics | Cloud Monitoring |
Key Differences
- Local debugging: SAM offers more mature tooling; GCP uses Functions Framework (simpler)
- Remote debugging: AWS uses X-Ray for tracing; GCP uses Cloud Debugger for live breakpoints
- Log analysis: CloudWatch Insights uses custom query language; Cloud Logging uses simpler filter syntax
- Performance: AWS has Lambda Insights extension; GCP has Cloud Profiler
Best Practices (Both Platforms)
- Log liberally in development — Be conservative in production
- Use log levels — DEBUG, INFO, WARN, ERROR (structured logging)
- Include context — Request ID, user ID, timestamps in every log
- Test error paths — Don't just test happy path
- Keep previous versions — For quick rollback during intensive debugging
- Correlate traces — Use request/trace IDs to follow requests across services
- Monitor memory — Set alerts for high memory usage trends
- Profile in production — GCP Profiler and AWS X-Ray work on real traffic
Hands-On: Multi-Cloud Debugging
AWS Lambda
- Create function with intentional bug:
aws lambda create-function \
--function-name debug-demo \
--runtime python3.12 \
--handler lambda_function.handler \
--zip-file fileb://function.zip
- Invoke and check logs:
aws lambda invoke \
--function-name debug-demo \
--payload '{"id": "123"}' \
response.json
aws logs tail /aws/lambda/debug-demo --follow
- Add debug logging and redeploy:
aws lambda update-function-code \
--function-name debug-demo \
--zip-file fileb://function-v2.zip
Google Cloud
- Deploy function:
gcloud functions deploy debug-demo \
--runtime python312 \
--trigger-http \
--allow-unauthenticated
- View logs:
gcloud functions logs read debug-demo --limit 50 --follow
- Query with Cloud Logging:
gcloud logging read \
'resource.type="cloud_function" AND resource.labels.function_name="debug-demo"' \
--limit 50
Key Takeaway
Debugging serverless requires different tools and mindset than traditional development. Structured logging is your best friend—add request IDs, log at every decision point, and use your platform's querying tools to find patterns. Local simulation helps catch issues early; production debugging relies on logs, metrics, and trace IDs.