Error Handling & Retries (AWS & GCP)

Resilient systems don't ignore errors—they anticipate, handle, and learn from them. Both AWS Lambda and Google Cloud Functions provide mechanisms for retrying failed operations, routing errors to queues, and logging failures. Understanding these patterns helps you build systems that recover gracefully.

Simple Explanation

What it is

Error handling is how your system reacts when something fails. Retries are how it tries again safely.

Why we need it

Cloud services fail sometimes. Without a plan, a small failure turns into a full outage.

Benefits

Higher reliability when transient failures occur.
Safer recovery through retries and dead-letter queues.
Clearer debugging because errors are captured and tracked.

Tradeoffs

More complexity in control flow and testing.
Risk of duplicate work if idempotency is not handled.

Real-world examples (architecture only)

Payment failure -> Retry with backoff -> DLQ after max retries.
File processing error -> Send to error queue for manual review.

Part 1: AWS Error Handling

Try-Catch Basics

Always wrap code that might fail:

import json


def handler(event, context):
    try:
        result = risky_operation()
        return {"statusCode": 200, "body": json.dumps(result)}
    except Exception as exc:
        print("Error:", exc)
        return {"statusCode": 500, "body": json.dumps({"error": str(exc)})}

Error Types

Application Errors

Bug in your code:

# ❌ Forgot to parse
data = event.get("body")  # String
count = len(data) + 1  # Might not work as expected

# ✅ Parse first
data = json.loads(event.get("body") or "{}")
count = len(data.get("items", []))

Service Errors

AWS service temporarily unavailable:

# DynamoDB might be busy
try:
    result = ddb.get_item(**params)
except Exception as exc:
    code = getattr(exc, "response", {}).get("Error", {}).get("Code")
    if code == "ProvisionedThroughputExceededException":
        # Too many requests to DynamoDB
        # Retry with exponential backoff
        pass

Configuration Errors

Missing environment variables:

import os

# ❌ Fails silently if env var missing
table_name = os.environ.get("TABLE_NAME")

# ✅ Fail fast with clear error
table_name = os.environ.get("TABLE_NAME")
if not table_name:
    raise RuntimeError("TABLE_NAME environment variable not set")

Retry Strategy

Manual Retry

import time


def with_retry(fn, max_attempts=3):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except Exception:
            if attempt == max_attempts:
                raise
            delay = 2 ** attempt
            print(f"Attempt {attempt} failed, retrying in {delay}s")
            time.sleep(delay)


def handler(event, context):
    result = with_retry(lambda: ddb.get_item(**params))
    return result

Exponential Backoff

Wait longer between retries:

Attempt 1: Fail
Wait 2 seconds
Attempt 2: Fail
Wait 4 seconds
Attempt 3: Fail
Wait 8 seconds
Attempt 4: Fail
Give up

Reduces load on overwhelmed service.

Jitter

Add randomness to prevent thundering herd:

# Without jitter: All instances retry at same time
# With jitter: Stagger retry times

import random

jitter = random.random()
delay = (2 ** attempt) + jitter

Lambda Retry Behavior

Sync vs. Async

Synchronous (API Gateway):

You wait for response
No automatic retry
Handle errors yourself

Asynchronous (S3, SNS triggers):

Fire and forget
Lambda automatically retries twice
Failed events go to DLQ

Configure Async Behavior

OrderProcessingFunction:
  Type: AWS::Serverless::Function
  Properties:
    AsynchronousInvocation:
      MaximumEventAge: 3600  # Max age 1 hour
      MaximumRetryAttempts: 2
      DestinationConfig:
        OnFailure:
          Type: SQS
          Destination: !GetAtt DeadLetterQueue.Arn

On failure after 2 retries, message goes to DLQ for investigation.

Error Response Formats

REST API Errors

return {
    "statusCode": 500,
    "body": json.dumps({
        "error": "Internal Server Error",
        "message": "Failed to process request",
        "requestId": context.aws_request_id,
    }),
}

Specific Error Codes

try:
    result = operation()
    return {"statusCode": 200, "body": result}
except Exception as exc:
    code = getattr(exc, "code", None)
    if code == "ValidationError":
        return {"statusCode": 400, "body": str(exc)}
    if code == "NotFound":
        return {"statusCode": 404, "body": str(exc)}
    return {"statusCode": 500, "body": "Server error"}

Transient vs. Permanent Errors

Transient (retry):

Network timeout
Service temporarily down
Throttling

Permanent (don't retry):

Invalid input (400)
Authentication failed (401)
S3 bucket doesn't exist (404)

def should_retry(error):
    retryable_statuses = {408, 429, 500, 502, 503, 504}
    return getattr(error, "statusCode", None) in retryable_statuses

Circuit Breaker Pattern

Stop retrying if service is broken:

import time

failure_count = 0
FAILURE_THRESHOLD = 5
circuit_open = False
circuit_until = 0


def handler(event, context):
    global failure_count, circuit_open, circuit_until

    if circuit_open and time.time() < circuit_until:
        return {"statusCode": 503, "body": "Service unavailable"}

    try:
        result = external_service_call()
        failure_count = 0
        circuit_open = False
        return result
    except Exception:
        failure_count += 1
        if failure_count > FAILURE_THRESHOLD:
            circuit_open = True
            circuit_until = time.time() + 60
        raise

When circuit opens, fail fast instead of hammering broken service.

Monitoring Errors

import json


def log_error(error, context):
    print(json.dumps({
        "level": "ERROR",
        "message": str(error),
        "requestId": context.aws_request_id,
        "functionName": context.function_name,
        "remainingMs": context.get_remaining_time_in_millis(),
    }))


def handler(event, context):
    try:
        # Your code
        pass
    except Exception as exc:
        log_error(exc, context)
        raise

Create alarm on error rate:

HighErrorRateAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    MetricName: Errors
    Threshold: 10
    TreatMissingData: notBreaching

Dead Letter Queues (DLQ)

Catch failures for later investigation:

ProcessingFunction:
  Type: AWS::Serverless::Function
  Properties:
    AsynchronousInvocation:
      DestinationConfig:
        OnFailure:
          Type: SQS
          Destination: !GetAtt FailedItems.Arn

FailedItems:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: failed-items-dlq
    MessageRetentionPeriod: 1209600  # 14 days

Later, process failed items manually or with retry function.

Chaos Engineering

Intentionally inject failures to test resilience:

import random


def should_fail():
    return random.random() < 0.05  # 5% failure rate


def handler(event, context):
    if should_fail():
        raise RuntimeError("Simulated failure for testing")
    # Normal operation

Enable only in staging, verify system recovers gracefully.

Best Practices (AWS)

Fail fast on permanent errors — Don't retry 404s, 401s, 400s
Exponential backoff — Wait 1s, 2s, 4s, 8s between retries
Jitter — Add randomness to prevent thundering herd
Circuit breakers — Stop retrying if service is broken
DLQs — Route failed async events to queues for investigation
Logging — Include request ID, status, and error details

Part 2: GCP Error Handling

Try-Catch Basics

Cloud Functions also uses standard try-catch:

import functions_framework


@functions_framework.http
def handle_request(request):
    try:
        result = process_request(request.get_json(silent=True) or {})
        return ({"success": True, "result": result}, 200)
    except Exception as exc:
        print("Error:", exc)
        return ({"error": str(exc)}, 500)

Error Types

Application Errors:

# Invalid input
payload = request.get_json(silent=True) or {}
if not payload.get("id"):
    return ({"error": "Missing id"}, 400)

Service Errors:

try:
    doc = firestore.collection("items").document(doc_id).get()
except Exception as exc:
    if getattr(exc, "code", None) == "UNAVAILABLE":
        return ({"error": "Service unavailable"}, 503)

Configuration Errors:

import os

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
if not project_id:
    raise RuntimeError("GOOGLE_CLOUD_PROJECT not set")

Retry Strategy

Manual retry with exponential backoff:

import random
import time


def is_retryable(error):
    retryable_codes = {"UNAVAILABLE", "DEADLINE_EXCEEDED", "INTERNAL"}
    return getattr(error, "code", None) in retryable_codes


def with_retry(fn, max_attempts=3):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except Exception as exc:
            if attempt == max_attempts or not is_retryable(exc):
                raise
            delay = (2 ** attempt) + random.random()
            print(f"Attempt {attempt} failed, retrying in {delay}s")
            time.sleep(delay)


@functions_framework.http
def my_function(request):
    try:
        payload = request.get_json(silent=True) or {}
        doc = with_retry(lambda: firestore.collection("items").document(payload["id"]).get())
        return ({"data": doc.to_dict()}, 200)
    except Exception as exc:
        return ({"error": str(exc)}, 500)

Cloud Tasks for Guaranteed Delivery

For critical operations that must succeed, use Cloud Tasks:

import json
import os
import time

import functions_framework
from google.cloud import tasks_v2

client = tasks_v2.CloudTasksClient()


@functions_framework.http
def enqueue_order(request):
    project = os.environ.get("GOOGLE_CLOUD_PROJECT", "PROJECT_ID")
    queue = "order-processing"
    location = "us-central1"

    body = json.dumps(request.get_json(silent=True) or {}).encode("utf-8")
    task = {
        "http_request": {
            "http_method": tasks_v2.HttpMethod.POST,
            "url": "https://YOUR-FUNCTION-URL/processOrder",
            "headers": {"Content-Type": "application/json"},
            "body": body,
        },
        "schedule_time": {"seconds": int(time.time())},
    }

    parent = client.queue_path(project, location, queue)
    response = client.create_task(request={"parent": parent, "task": task})
    return ({"taskName": response.name}, 200)

Cloud Tasks automatically retries (configurable: up to 5 times by default) with exponential backoff and jitter.

Pub/Sub Retry Policy

For event-driven workflows, use Pub/Sub with dead-letter topics:

from google.cloud import pubsub_v1

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path("PROJECT_ID", "order-processing-sub")


def callback(message):
    try:
        process_order(message.data)
        message.ack()
    except Exception as exc:
        print("Error processing message:", exc)
        message.nack()


subscriber.subscribe(subscription_path, callback=callback)

Configure via gcloud:

gcloud pubsub subscriptions create order-sub \
  --topic=orders \
  --dead-letter-topic=order-dlq \
  --max-delivery-attempts=5

Error Reporting (GCP Native)

Report errors to Cloud Error Reporting:

from google.cloud import error_reporting

client = error_reporting.Client()


@functions_framework.http
def my_function(request):
    try:
        risky_operation()
        return ("OK", 200)
    except Exception:
        client.report_exception()
        return ("Internal error", 500)

View aggregated errors and trends in Cloud Console.

AWS vs. GCP Error Handling

Feature	AWS Lambda	Google Cloud Functions
Sync invocation failures	Return HTTP status code	Return HTTP status code
Async invocation failures	Automatic retry × 2 (configurable)	Event retried by source (e.g., Pub/Sub × 5)
Failed event destination	DLQ (SQS/SNS)	Dead-letter topic (Pub/Sub) or GCS bucket
Retry backoff	Manual in code	Source handles (Pub/Sub exponential backoff)
Guaranteed delivery	SQS has built-in durability	Cloud Tasks, Pub/Sub
Error monitoring	CloudWatch alarms, X-Ray	Cloud Logging, Error Reporting, Cloud Alerting
Circuit breaker	Manual implementation	Manual implementation
Request timeout	28 minute limit	540 second (9 minute) limit
Max message age	Configurable (OnFailure)	Configurable per Pub/Sub subscription

Key Differences

Async retry behavior: AWS Lambda retries asynchronously-invoked functions automatically; GCP relies on the event source's retry policy
Delivery guarantees: Cloud Tasks provides exactly-once delivery; SQS provides at-least-once
DLQ routing: AWS uses SNS/SQS; GCP uses Pub/Sub topics or Cloud Storage
Error visibility: CloudWatch needs alarms setup; Cloud Error Reporting auto-aggregates errors

Transient vs. Permanent Errors

Both platforms follow the same principle:

Transient (retry):

Network timeout (408, 504)
Service temporarily unavailable (429, 503)
Deadline exceeded
Temporary Firestore lock

Permanent (don't retry):

Invalid input (400)
Authentication failed (401)
Authorization failed (403)
Resource not found (404)
Invalid message format (JSON parse error)

# Shared logic for both platforms
def should_retry(error):
    if getattr(error, "statusCode", None):
        retryable_codes = {408, 429, 500, 502, 503, 504}
        return error.statusCode in retryable_codes
    retryable_codes = {"UNAVAILABLE", "DEADLINE_EXCEEDED", "INTERNAL"}
    return getattr(error, "code", None) in retryable_codes

Circuit Breaker Pattern

Prevent hammering broken services:

# Works on both AWS and GCP
import time


class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.timeout = timeout
        self.state = "CLOSED"
        self.next_attempt = time.time()

    def execute(self, fn):
        if self.state == "OPEN":
            if time.time() < self.next_attempt:
                raise RuntimeError("Circuit breaker is OPEN")
            self.state = "HALF_OPEN"

        try:
            result = fn()
            self.on_success()
            return result
        except Exception:
            self.on_failure()
            raise

    def on_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

    def on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.threshold:
            self.state = "OPEN"
            self.next_attempt = time.time() + self.timeout


breaker = CircuitBreaker()


def aws_handler(event, context):
    try:
        result = breaker.execute(lambda: ddb.get_item(**params))
        return {"statusCode": 200, "body": result}
    except Exception:
        return {"statusCode": 503, "body": "Service unavailable"}


@functions_framework.http
def gcp_handler(request):
    try:
        payload = request.get_json(silent=True) or {}
        doc = breaker.execute(lambda: firestore.collection("items").document(payload["id"]).get())
        return ({"data": doc.to_dict()}, 200)
    except Exception:
        return ({"error": "Service unavailable"}, 503)

Best Practices (Both Platforms)

Fail fast on permanent errors — Retry only on transient failures
Use exponential backoff — Base delay × 2^attempt with jitter
Implement circuit breakers — Stop retrying broken services
Route failures somewhere — DLQ (AWS), dead-letter topic (GCP)
Monitor error rates — Alert on spikes
Log with context — Request ID, user ID, attempt number
Test error paths — Inject failures to verify recovery
Document recovery — How to manually replay failed messages

Hands-On: Multi-Cloud Resilient API

AWS Lambda

import json
import os
import time

import boto3

ddb = boto3.client("dynamodb")


def with_retry(fn, max_attempts=3):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except Exception as exc:
            if not should_retry(exc) or attempt == max_attempts:
                raise
            time.sleep(2 ** attempt)


def handler(event, context):
    try:
        result = with_retry(lambda: ddb.get_item(
            TableName=os.environ.get("TABLE_NAME"),
            Key={"id": {"S": event.get("id")}},
        ))
        return {"statusCode": 200, "body": json.dumps(result)}
    except Exception as exc:
        print("Final error:", exc)
        return {"statusCode": 500, "body": json.dumps({"error": str(exc)})}

Deploy with DLQ:

sam deploy --template-file template.yaml

Google Cloud

import time

import functions_framework
from google.cloud import firestore

db = firestore.Client()


def with_retry(fn, max_attempts=3):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except Exception as exc:
            if not should_retry(exc) or attempt == max_attempts:
                raise
            time.sleep(2 ** attempt)


@functions_framework.http
def get_order(request):
    try:
        payload = request.get_json(silent=True) or {}
        doc = with_retry(lambda: db.collection("orders").document(payload["id"]).get())
        return ({"data": doc.to_dict()}, 200)
    except Exception as exc:
        print("Final error:", exc)
        return ({"error": str(exc)}, 500)

Deploy:

gcloud functions deploy getOrder \
    --runtime python312 \
    --trigger-http \
    --allow-unauthenticated

Key Takeaway

Resilient systems expect failures and plan for them. Retry transient errors with exponential backoff, fail fast on permanent errors, handle circuit breakers, and never lose messages. Both AWS and GCP provide the primitives—your job is using them correctly.

Simple Explanation​

What it is​

Why we need it​

Benefits​

Tradeoffs​

Real-world examples (architecture only)​

Part 1: AWS Error Handling​

Try-Catch Basics​

Error Types​

Application Errors​

Service Errors​

Configuration Errors​

Retry Strategy​

Manual Retry​

Exponential Backoff​

Jitter​

Lambda Retry Behavior​

Sync vs. Async​

Configure Async Behavior​

Error Response Formats​

REST API Errors​

Specific Error Codes​

Transient vs. Permanent Errors​

Circuit Breaker Pattern​

Monitoring Errors​

Dead Letter Queues (DLQ)​

Chaos Engineering​

Best Practices (AWS)​

Part 2: GCP Error Handling​

Try-Catch Basics​

Error Types​

Retry Strategy​

Cloud Tasks for Guaranteed Delivery​

Pub/Sub Retry Policy​

Error Reporting (GCP Native)​

AWS vs. GCP Error Handling​

Key Differences​

Transient vs. Permanent Errors​

Circuit Breaker Pattern​

Best Practices (Both Platforms)​

Hands-On: Multi-Cloud Resilient API​

AWS Lambda​

Google Cloud​

Key Takeaway​

Simple Explanation

What it is

Why we need it

Benefits

Tradeoffs

Real-world examples (architecture only)

Part 1: AWS Error Handling

Try-Catch Basics

Error Types

Application Errors

Service Errors

Configuration Errors

Retry Strategy

Manual Retry

Exponential Backoff

Jitter

Lambda Retry Behavior

Sync vs. Async

Configure Async Behavior

Error Response Formats

REST API Errors

Specific Error Codes

Transient vs. Permanent Errors

Circuit Breaker Pattern

Monitoring Errors

Dead Letter Queues (DLQ)

Chaos Engineering

Best Practices (AWS)

Part 2: GCP Error Handling

Try-Catch Basics

Error Types

Retry Strategy

Cloud Tasks for Guaranteed Delivery

Pub/Sub Retry Policy

Error Reporting (GCP Native)

AWS vs. GCP Error Handling

Key Differences

Transient vs. Permanent Errors

Circuit Breaker Pattern

Best Practices (Both Platforms)

Hands-On: Multi-Cloud Resilient API

AWS Lambda

Google Cloud

Key Takeaway