Error Handling & Retries (AWS & GCP)
Resilient systems don't ignore errors—they anticipate, handle, and learn from them. Both AWS Lambda and Google Cloud Functions provide mechanisms for retrying failed operations, routing errors to queues, and logging failures. Understanding these patterns helps you build systems that recover gracefully.
Simple Explanation
What it is
Error handling is how your system reacts when something fails. Retries are how it tries again safely.
Why we need it
Cloud services fail sometimes. Without a plan, a small failure turns into a full outage.
Benefits
- Higher reliability when transient failures occur.
- Safer recovery through retries and dead-letter queues.
- Clearer debugging because errors are captured and tracked.
Tradeoffs
- More complexity in control flow and testing.
- Risk of duplicate work if idempotency is not handled.
Real-world examples (architecture only)
- Payment failure -> Retry with backoff -> DLQ after max retries.
- File processing error -> Send to error queue for manual review.
Part 1: AWS Error Handling
Try-Catch Basics
Always wrap code that might fail:
import json
def handler(event, context):
try:
result = risky_operation()
return {"statusCode": 200, "body": json.dumps(result)}
except Exception as exc:
print("Error:", exc)
return {"statusCode": 500, "body": json.dumps({"error": str(exc)})}
Error Types
Application Errors
Bug in your code:
# ❌ Forgot to parse
data = event.get("body") # String
count = len(data) + 1 # Might not work as expected
# ✅ Parse first
data = json.loads(event.get("body") or "{}")
count = len(data.get("items", []))
Service Errors
AWS service temporarily unavailable:
# DynamoDB might be busy
try:
result = ddb.get_item(**params)
except Exception as exc:
code = getattr(exc, "response", {}).get("Error", {}).get("Code")
if code == "ProvisionedThroughputExceededException":
# Too many requests to DynamoDB
# Retry with exponential backoff
pass
Configuration Errors
Missing environment variables:
import os
# ❌ Fails silently if env var missing
table_name = os.environ.get("TABLE_NAME")
# ✅ Fail fast with clear error
table_name = os.environ.get("TABLE_NAME")
if not table_name:
raise RuntimeError("TABLE_NAME environment variable not set")
Retry Strategy
Manual Retry
import time
def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception:
if attempt == max_attempts:
raise
delay = 2 ** attempt
print(f"Attempt {attempt} failed, retrying in {delay}s")
time.sleep(delay)
def handler(event, context):
result = with_retry(lambda: ddb.get_item(**params))
return result
Exponential Backoff
Wait longer between retries:
Attempt 1: Fail
Wait 2 seconds
Attempt 2: Fail
Wait 4 seconds
Attempt 3: Fail
Wait 8 seconds
Attempt 4: Fail
Give up
Reduces load on overwhelmed service.
Jitter
Add randomness to prevent thundering herd:
# Without jitter: All instances retry at same time
# With jitter: Stagger retry times
import random
jitter = random.random()
delay = (2 ** attempt) + jitter
Lambda Retry Behavior
Sync vs. Async
Synchronous (API Gateway):
- You wait for response
- No automatic retry
- Handle errors yourself
Asynchronous (S3, SNS triggers):
- Fire and forget
- Lambda automatically retries twice
- Failed events go to DLQ
Configure Async Behavior
OrderProcessingFunction:
Type: AWS::Serverless::Function
Properties:
AsynchronousInvocation:
MaximumEventAge: 3600 # Max age 1 hour
MaximumRetryAttempts: 2
DestinationConfig:
OnFailure:
Type: SQS
Destination: !GetAtt DeadLetterQueue.Arn
On failure after 2 retries, message goes to DLQ for investigation.
Error Response Formats
REST API Errors
return {
"statusCode": 500,
"body": json.dumps({
"error": "Internal Server Error",
"message": "Failed to process request",
"requestId": context.aws_request_id,
}),
}
Specific Error Codes
try:
result = operation()
return {"statusCode": 200, "body": result}
except Exception as exc:
code = getattr(exc, "code", None)
if code == "ValidationError":
return {"statusCode": 400, "body": str(exc)}
if code == "NotFound":
return {"statusCode": 404, "body": str(exc)}
return {"statusCode": 500, "body": "Server error"}
Transient vs. Permanent Errors
Transient (retry):
- Network timeout
- Service temporarily down
- Throttling
Permanent (don't retry):
- Invalid input (400)
- Authentication failed (401)
- S3 bucket doesn't exist (404)
def should_retry(error):
retryable_statuses = {408, 429, 500, 502, 503, 504}
return getattr(error, "statusCode", None) in retryable_statuses
Circuit Breaker Pattern
Stop retrying if service is broken:
import time
failure_count = 0
FAILURE_THRESHOLD = 5
circuit_open = False
circuit_until = 0
def handler(event, context):
global failure_count, circuit_open, circuit_until
if circuit_open and time.time() < circuit_until:
return {"statusCode": 503, "body": "Service unavailable"}
try:
result = external_service_call()
failure_count = 0
circuit_open = False
return result
except Exception:
failure_count += 1
if failure_count > FAILURE_THRESHOLD:
circuit_open = True
circuit_until = time.time() + 60
raise
When circuit opens, fail fast instead of hammering broken service.
Monitoring Errors
import json
def log_error(error, context):
print(json.dumps({
"level": "ERROR",
"message": str(error),
"requestId": context.aws_request_id,
"functionName": context.function_name,
"remainingMs": context.get_remaining_time_in_millis(),
}))
def handler(event, context):
try:
# Your code
pass
except Exception as exc:
log_error(exc, context)
raise
Create alarm on error rate:
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
MetricName: Errors
Threshold: 10
TreatMissingData: notBreaching
Dead Letter Queues (DLQ)
Catch failures for later investigation:
ProcessingFunction:
Type: AWS::Serverless::Function
Properties:
AsynchronousInvocation:
DestinationConfig:
OnFailure:
Type: SQS
Destination: !GetAtt FailedItems.Arn
FailedItems:
Type: AWS::SQS::Queue
Properties:
QueueName: failed-items-dlq
MessageRetentionPeriod: 1209600 # 14 days
Later, process failed items manually or with retry function.
Chaos Engineering
Intentionally inject failures to test resilience:
import random
def should_fail():
return random.random() < 0.05 # 5% failure rate
def handler(event, context):
if should_fail():
raise RuntimeError("Simulated failure for testing")
# Normal operation
Enable only in staging, verify system recovers gracefully.
Best Practices (AWS)
- Fail fast on permanent errors — Don't retry 404s, 401s, 400s
- Exponential backoff — Wait 1s, 2s, 4s, 8s between retries
- Jitter — Add randomness to prevent thundering herd
- Circuit breakers — Stop retrying if service is broken
- DLQs — Route failed async events to queues for investigation
- Logging — Include request ID, status, and error details
Part 2: GCP Error Handling
Try-Catch Basics
Cloud Functions also uses standard try-catch:
import functions_framework
@functions_framework.http
def handle_request(request):
try:
result = process_request(request.get_json(silent=True) or {})
return ({"success": True, "result": result}, 200)
except Exception as exc:
print("Error:", exc)
return ({"error": str(exc)}, 500)
Error Types
Application Errors:
# Invalid input
payload = request.get_json(silent=True) or {}
if not payload.get("id"):
return ({"error": "Missing id"}, 400)
Service Errors:
try:
doc = firestore.collection("items").document(doc_id).get()
except Exception as exc:
if getattr(exc, "code", None) == "UNAVAILABLE":
return ({"error": "Service unavailable"}, 503)
Configuration Errors:
import os
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
if not project_id:
raise RuntimeError("GOOGLE_CLOUD_PROJECT not set")
Retry Strategy
Manual retry with exponential backoff:
import random
import time
def is_retryable(error):
retryable_codes = {"UNAVAILABLE", "DEADLINE_EXCEEDED", "INTERNAL"}
return getattr(error, "code", None) in retryable_codes
def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as exc:
if attempt == max_attempts or not is_retryable(exc):
raise
delay = (2 ** attempt) + random.random()
print(f"Attempt {attempt} failed, retrying in {delay}s")
time.sleep(delay)
@functions_framework.http
def my_function(request):
try:
payload = request.get_json(silent=True) or {}
doc = with_retry(lambda: firestore.collection("items").document(payload["id"]).get())
return ({"data": doc.to_dict()}, 200)
except Exception as exc:
return ({"error": str(exc)}, 500)
Cloud Tasks for Guaranteed Delivery
For critical operations that must succeed, use Cloud Tasks:
import json
import os
import time
import functions_framework
from google.cloud import tasks_v2
client = tasks_v2.CloudTasksClient()
@functions_framework.http
def enqueue_order(request):
project = os.environ.get("GOOGLE_CLOUD_PROJECT", "PROJECT_ID")
queue = "order-processing"
location = "us-central1"
body = json.dumps(request.get_json(silent=True) or {}).encode("utf-8")
task = {
"http_request": {
"http_method": tasks_v2.HttpMethod.POST,
"url": "https://YOUR-FUNCTION-URL/processOrder",
"headers": {"Content-Type": "application/json"},
"body": body,
},
"schedule_time": {"seconds": int(time.time())},
}
parent = client.queue_path(project, location, queue)
response = client.create_task(request={"parent": parent, "task": task})
return ({"taskName": response.name}, 200)
Cloud Tasks automatically retries (configurable: up to 5 times by default) with exponential backoff and jitter.
Pub/Sub Retry Policy
For event-driven workflows, use Pub/Sub with dead-letter topics:
from google.cloud import pubsub_v1
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path("PROJECT_ID", "order-processing-sub")
def callback(message):
try:
process_order(message.data)
message.ack()
except Exception as exc:
print("Error processing message:", exc)
message.nack()
subscriber.subscribe(subscription_path, callback=callback)
Configure via gcloud:
gcloud pubsub subscriptions create order-sub \
--topic=orders \
--dead-letter-topic=order-dlq \
--max-delivery-attempts=5
Error Reporting (GCP Native)
Report errors to Cloud Error Reporting:
from google.cloud import error_reporting
client = error_reporting.Client()
@functions_framework.http
def my_function(request):
try:
risky_operation()
return ("OK", 200)
except Exception:
client.report_exception()
return ("Internal error", 500)
View aggregated errors and trends in Cloud Console.
AWS vs. GCP Error Handling
| Feature | AWS Lambda | Google Cloud Functions |
|---|---|---|
| Sync invocation failures | Return HTTP status code | Return HTTP status code |
| Async invocation failures | Automatic retry × 2 (configurable) | Event retried by source (e.g., Pub/Sub × 5) |
| Failed event destination | DLQ (SQS/SNS) | Dead-letter topic (Pub/Sub) or GCS bucket |
| Retry backoff | Manual in code | Source handles (Pub/Sub exponential backoff) |
| Guaranteed delivery | SQS has built-in durability | Cloud Tasks, Pub/Sub |
| Error monitoring | CloudWatch alarms, X-Ray | Cloud Logging, Error Reporting, Cloud Alerting |
| Circuit breaker | Manual implementation | Manual implementation |
| Request timeout | 28 minute limit | 540 second (9 minute) limit |
| Max message age | Configurable (OnFailure) | Configurable per Pub/Sub subscription |
Key Differences
- Async retry behavior: AWS Lambda retries asynchronously-invoked functions automatically; GCP relies on the event source's retry policy
- Delivery guarantees: Cloud Tasks provides exactly-once delivery; SQS provides at-least-once
- DLQ routing: AWS uses SNS/SQS; GCP uses Pub/Sub topics or Cloud Storage
- Error visibility: CloudWatch needs alarms setup; Cloud Error Reporting auto-aggregates errors
Transient vs. Permanent Errors
Both platforms follow the same principle:
Transient (retry):
- Network timeout (408, 504)
- Service temporarily unavailable (429, 503)
- Deadline exceeded
- Temporary Firestore lock
Permanent (don't retry):
- Invalid input (400)
- Authentication failed (401)
- Authorization failed (403)
- Resource not found (404)
- Invalid message format (JSON parse error)
# Shared logic for both platforms
def should_retry(error):
if getattr(error, "statusCode", None):
retryable_codes = {408, 429, 500, 502, 503, 504}
return error.statusCode in retryable_codes
retryable_codes = {"UNAVAILABLE", "DEADLINE_EXCEEDED", "INTERNAL"}
return getattr(error, "code", None) in retryable_codes
Circuit Breaker Pattern
Prevent hammering broken services:
# Works on both AWS and GCP
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.threshold = failure_threshold
self.timeout = timeout
self.state = "CLOSED"
self.next_attempt = time.time()
def execute(self, fn):
if self.state == "OPEN":
if time.time() < self.next_attempt:
raise RuntimeError("Circuit breaker is OPEN")
self.state = "HALF_OPEN"
try:
result = fn()
self.on_success()
return result
except Exception:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
self.state = "CLOSED"
def on_failure(self):
self.failure_count += 1
if self.failure_count >= self.threshold:
self.state = "OPEN"
self.next_attempt = time.time() + self.timeout
breaker = CircuitBreaker()
def aws_handler(event, context):
try:
result = breaker.execute(lambda: ddb.get_item(**params))
return {"statusCode": 200, "body": result}
except Exception:
return {"statusCode": 503, "body": "Service unavailable"}
@functions_framework.http
def gcp_handler(request):
try:
payload = request.get_json(silent=True) or {}
doc = breaker.execute(lambda: firestore.collection("items").document(payload["id"]).get())
return ({"data": doc.to_dict()}, 200)
except Exception:
return ({"error": "Service unavailable"}, 503)
Best Practices (Both Platforms)
- Fail fast on permanent errors — Retry only on transient failures
- Use exponential backoff — Base delay × 2^attempt with jitter
- Implement circuit breakers — Stop retrying broken services
- Route failures somewhere — DLQ (AWS), dead-letter topic (GCP)
- Monitor error rates — Alert on spikes
- Log with context — Request ID, user ID, attempt number
- Test error paths — Inject failures to verify recovery
- Document recovery — How to manually replay failed messages
Hands-On: Multi-Cloud Resilient API
AWS Lambda
import json
import os
import time
import boto3
ddb = boto3.client("dynamodb")
def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as exc:
if not should_retry(exc) or attempt == max_attempts:
raise
time.sleep(2 ** attempt)
def handler(event, context):
try:
result = with_retry(lambda: ddb.get_item(
TableName=os.environ.get("TABLE_NAME"),
Key={"id": {"S": event.get("id")}},
))
return {"statusCode": 200, "body": json.dumps(result)}
except Exception as exc:
print("Final error:", exc)
return {"statusCode": 500, "body": json.dumps({"error": str(exc)})}
Deploy with DLQ:
sam deploy --template-file template.yaml
Google Cloud
import time
import functions_framework
from google.cloud import firestore
db = firestore.Client()
def with_retry(fn, max_attempts=3):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as exc:
if not should_retry(exc) or attempt == max_attempts:
raise
time.sleep(2 ** attempt)
@functions_framework.http
def get_order(request):
try:
payload = request.get_json(silent=True) or {}
doc = with_retry(lambda: db.collection("orders").document(payload["id"]).get())
return ({"data": doc.to_dict()}, 200)
except Exception as exc:
print("Final error:", exc)
return ({"error": str(exc)}, 500)
Deploy:
gcloud functions deploy getOrder \
--runtime python312 \
--trigger-http \
--allow-unauthenticated
Key Takeaway
Resilient systems expect failures and plan for them. Retry transient errors with exponential backoff, fail fast on permanent errors, handle circuit breakers, and never lose messages. Both AWS and GCP provide the primitives—your job is using them correctly.