Monitoring & Alarms: CloudWatch (AWS) & Cloud Monitoring (GCP)

Monitoring reveals system health in real-time. While logging captures detailed events, metrics track trends: invocations, errors, latency, and resource usage. Both AWS and GCP provide dashboards and alerting to notify you when something goes wrong.

Simple Explanation

What it is

Monitoring turns raw activity into charts and alerts, so you can see if your system is healthy without reading every log line.

Why we need it

Logs tell you what happened. Monitoring tells you when something is going wrong right now so you can act fast.

Benefits

Early warning when errors or latency rise.
Clear trends over time for capacity planning.
Automated alerts instead of manual checking.

Tradeoffs

Alert fatigue if thresholds are noisy.
Setup time to build useful dashboards.

Real-world examples (architecture only)

Error rate > 5% -> Pager alert -> Rollback.
Duration spikes -> Investigate slow dependency.

Part 1: AWS CloudWatch Metrics & Alarms

CloudWatch Metrics

Lambda automatically publishes these metrics to CloudWatch:

Invocations: Total function executions
Duration: Execution time (milliseconds)
Errors: Failed invocations
Throttles: Request rejections due to concurrency limits
ConcurrentExecutions: Simultaneous running functions
UnreservedConcurrentExecutions: Overflow beyond reserved capacity

View Metrics in Console

Navigate to Lambda > Your Function
Click the Monitor tab
View graphs for Invocations, Errors, Duration, Throttles

Custom Metrics

Publish application-specific metrics to CloudWatch:

import time
from datetime import datetime

import boto3

cloudwatch = boto3.client("cloudwatch")


def handler(event, context):
    start_time = time.time()

    try:
        result = process_order(event)
        duration_ms = int((time.time() - start_time) * 1000)

        cloudwatch.put_metric_data(
            Namespace="MyApp/Orders",
            MetricData=[
                {
                    "MetricName": "OrderProcessingTime",
                    "Value": duration_ms,
                    "Unit": "Milliseconds",
                    "Timestamp": datetime.utcnow(),
                },
                {
                    "MetricName": "OrderAmount",
                    "Value": event.get("amount", 0),
                    "Unit": "None",
                    "Timestamp": datetime.utcnow(),
                },
            ],
        )

        return result
    except Exception:
        cloudwatch.put_metric_data(
            Namespace="MyApp/Orders",
            MetricData=[
                {
                    "MetricName": "OrderErrors",
                    "Value": 1,
                    "Unit": "Count",
                }
            ],
        )
        raise

CloudWatch Dashboards

Create custom dashboards via SAM:

MonitoringDashboard:
  Type: AWS::CloudWatch::Dashboard
  Properties:
    DashboardName: OrderProcessingDashboard
    DashboardBody: !Sub |
      {
        "widgets": [
          {
            "type": "metric",
            "properties": {
              "metrics": [
                ["AWS/Lambda", "Invocations", {"stat": "Sum"}],
                ["AWS/Lambda", "Errors", {"stat": "Sum"}],
                ["AWS/Lambda", "Duration", {"stat": "Average"}],
                ["MyApp/Orders", "OrderProcessingTime", {"stat": "Average"}]
              ],
              "period": 300,
              "stat": "Average",
              "region": "${AWS::Region}",
              "title": "Lambda Performance"
            }
          }
        ]
      }

CloudWatch Alarms

Create alarms to trigger notifications:

HighErrorRateAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: OrderProcessing-HighErrorRate
    MetricName: Errors
    Namespace: AWS/Lambda
    Statistic: Sum
    Period: 300  # 5 minutes
    EvaluationPeriods: 1
    Threshold: 5
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
      - !Ref CriticalAlertTopic

HighDurationAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: OrderProcessing-HighDuration
    MetricName: Duration
    Namespace: AWS/Lambda
    Statistic: Average
    Period: 300
    EvaluationPeriods: 2
    Threshold: 10000  # 10 seconds
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
      - !Ref WarningTopic

ThrottlingAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: Lambda-Throttling-Alert
    MetricName: Throttles
    Namespace: AWS/Lambda
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
      - !Ref CriticalAlertTopic

Metric Math

Combine metrics into derived metrics:

ErrorRateMetric:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: OrderErrorRate
    Metrics:
      - Id: m1
        ReturnData: true
        MetricStat:
          Metric:
            MetricName: Errors
            Namespace: AWS/Lambda
          Period: 300
          Stat: Sum
      - Id: m2
        ReturnData: false
        MetricStat:
          Metric:
            MetricName: Invocations
            Namespace: AWS/Lambda
          Period: 300
          Stat: Sum
      - Id: error_rate
        Expression: (m1/m2)*100
        ReturnData: true
    Threshold: 5
    ComparisonOperator: GreaterThanThreshold

Anomaly Detection

Automatically detect abnormal patterns:

DurationAnomalyDetector:
  Type: AWS::CloudWatch::AnomalyDetector
  Properties:
    MetricName: Duration
    Namespace: AWS/Lambda

DurationAnomalyAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: DurationAnomaly
    Metrics:
      - Id: m1
        ReturnData: true
        MetricStat:
          Metric:
            MetricName: Duration
            Namespace: AWS/Lambda
          Period: 300
          Stat: Average
      - Id: ad1
        Expression: ANOMALY_DETECTION_BAND(m1, 2)  # 2 standard deviations
        ReturnData: true
    ThresholdMetricId: ad1
    ComparisonOperator: LessThanLowerOrGreaterThanUpperThreshold

Route alarms to email, SMS, or Slack:

CriticalAlertTopic:
  Type: AWS::SNS::Topic
  Properties:
    DisplayName: CriticalAlerts
    Subscription:
      - Endpoint: [email protected]
        Protocol: email
      - Endpoint: +1234567890
        Protocol: sms

SlackNotification:
  Type: AWS::CloudFormation::CustomResource
  Properties:
    ServiceToken: !GetAtt SlackBridgeLambda.Arn
    TopicArn: !Ref CriticalAlertTopic
    SlackWebhookUrl: !Sub '{{resolve:secretsmanager:slack-webhook-url:SecretString:url}}'

Part 2: Google Cloud Monitoring & Cloud Alerting

Cloud Monitoring Metrics

Google Cloud automatically collects metrics on Cloud Function execution:

execution_count: Total invocations
execution_times: Latency distribution
execution_return_code: success (0) or error (non-zero)
user_memory_bytes: Memory allocated
gen_ai_tokens: Tokens counted for AI workloads

Custom Metrics

Send custom metrics using the Monitoring API:

import time

import os

import functions_framework
from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "PROJECT_ID")
project_name = f"projects/{project_id}"


@functions_framework.http
def process_order(request):
    start_time = time.time()
    payload = request.get_json(silent=True) or {}

    try:
        result = process_order_logic(payload)
        duration_ms = (time.time() - start_time) * 1000

        series = monitoring_v3.TimeSeries()
        series.metric.type = "custom.googleapis.com/order_processing_time"
        series.resource.type = "cloud_function"
        series.resource.labels["function_name"] = "processOrder"
        point = series.points.add()
        point.value.double_value = duration_ms
        point.interval.end_time.seconds = int(time.time())

        client.create_time_series(name=project_name, time_series=[series])

        return ({"success": True, "result": result}, 200)
    except Exception as exc:
        series = monitoring_v3.TimeSeries()
        series.metric.type = "custom.googleapis.com/order_errors"
        series.resource.type = "cloud_function"
        series.resource.labels["function_name"] = "processOrder"
        point = series.points.add()
        point.value.int64_value = 1
        point.interval.end_time.seconds = int(time.time())

        client.create_time_series(name=project_name, time_series=[series])

        return ({"error": str(exc)}, 500)

Cloud Monitoring Dashboards

Create dashboards in Cloud Console or via Terraform:

from google.cloud import monitoring_dashboards

dashboard = {
    'display_name': 'Order Processing Dashboard',
    'grid_layout': {
        'widgets': [
            {
                'title': 'Execution Count',
                'xy_chart': {
                    'data_sets': [
                        {
                            'time_series_query': {
                                'time_series_filter': {
                                    'filter': 'resource.type="cloud_function" AND resource.labels.function_name="processOrder"',
                                    'aggregation': {
                                        'alignment_period': '60s',
                                        'per_series_aligner': 'ALIGN_RATE'
                                    }
                                }
                            }
                        }
                    ]
                }
            },
            {
                'title': 'Execution Time (P99)',
                'xy_chart': {
                    'data_sets': [
                        {
                            'time_series_query': {
                                'time_series_filter': {
                                    'filter': 'metric.type="cloudfunctions.googleapis.com/execution_times"',
                                    'aggregation': {
                                        'alignment_period': '60s',
                                        'per_series_aligner': 'ALIGN_PERCENTILE_99'
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        ]
    }
}

client = monitoring_dashboards.DashboardsClient()
dashboard = client.create_dashboard(name=f'projects/{PROJECT_ID}', dashboard=dashboard)

Cloud Alerting Policies

Create alert policies via Cloud Console or code:

from google.cloud import monitoring_v3

client = monitoring_v3.AlertPolicyServiceClient()
project = f'projects/{PROJECT_ID}'

# Alert on high error rate
policy = {
    'display_name': 'High Error Rate - Order Processing',
    'conditions': [
        {
            'display_name': 'Errors per minute > 5',
            'condition_threshold': {
                'filter': 'resource.type="cloud_function" AND resource.labels.function_name="processOrder" AND metric.type="cloudfunctions.googleapis.com/execution_return_code" AND metric.labels.status !="ok"',
                'comparison': monitoring_v3.ComparisonType.COMPARISON_GT,
                'threshold_value': 5,
                'duration': {'seconds': 300},
                'aggregations': [
                    {
                        'alignment_period': {'seconds': 60},
                        'per_series_aligner': monitoring_v3.Aggregation.ALIGN_RATE
                    }
                ]
            }
        }
    ],
    'notification_channels': [webhook_channel_id],
    'alert_strategy': {
        'auto_close': {'duration': {'seconds': 1800}}  # Auto-close after 30 min
    }
}

alert_policy = client.create_alert_policy(
    name=project,
    alert_policy=policy
)

Notification Channels

Route alerts to email, SMS, Slack, PagerDuty:

channels_client = monitoring_v3.NotificationChannelServiceClient()
project = f'projects/{PROJECT_ID}'

# Email notification
email_channel = {
    'type': 'email',
    'display_name': 'Team Email',
    'labels': {
        'email_address': '[email protected]'
    },
    'enabled': True
}

channel = channels_client.create_notification_channel(
    name=project,
    notification_channel=email_channel
)

# Slack webhook
slack_channel = {
    'type': 'slack',
    'display_name': 'Slack Alerts',
    'labels': {
        'channel_name': '#incidents'
    },
    'enabled': True
}

channel = channels_client.create_notification_channel(
    name=project,
    notification_channel=slack_channel
)

Uptime Checks

Monitor Cloud Function endpoints:

uptime_check = {
    'display_name': 'API Health Check',
    'monitored_resource': {
        'type': 'uptime_url',
        'labels': {
            'host': 'api.example.com'
        }
    },
    'http_check': {
        'path': '/health',
        'port': 443,
        'request_method': 'GET',
        'use_ssl': True,
        'accept_redirect': True
    },
    'period': {'seconds': 60},
    'timeout': {'seconds': 10},
    'selected_regions': ['USA', 'EUROPE', 'ASIA_PACIFIC']
}

uptime_client = monitoring_v3.UptimeCheckServiceClient()
uptime_client.create_uptime_check_config(
    parent=f'projects/{PROJECT_ID}',
    uptime_check_config=uptime_check
)

CloudWatch (AWS) vs. Cloud Monitoring (GCP)

Feature	CloudWatch	Cloud Monitoring
Default metrics	Invocations, Errors, Duration, Throttles	Execution count, execution_times, return_code
Metric retention	15 months	5 weeks (standard)
Dashboard setup	Via Console or CloudFormation	Via Console or Terraform
Custom metrics	AWS SDK `PutMetricDataCommand`	`@google-cloud/monitoring` library
Query language	CloudWatch Metrics Math	Cloud Monitoring MQL (Monitoring Query Language)
Alarms	Threshold-based, anomaly detection	Threshold-based, multi-condition
Alerting channels	SNS (email, SMS, SQS, Lambda)	Email, SMS, Slack, Page Duty, webhook
Log-based metrics	CloudWatch Logs Insights	Cloud Logging → Monitoring
Pricing	$0.10 per custom metric per month	Free for most services
Uptime monitoring	Route 53 Health Checks	Cloud Monitoring Uptime Checks

Key Differences

Retention: CloudWatch keeps 15 months of detailed metrics; Cloud Monitoring keeps 5 weeks by default
Pricing: CloudWatch charges per custom metric; Cloud Monitoring is free for GCP services
Query power: CloudWatch Math is simpler; Cloud Monitoring uses MQL for complex queries
Default coverage: Both auto-collect function metrics, but different names/labels

Best Practices (Both Platforms)

Alert on impact, not noise — High error rate matters; single error usually doesn't
Use multiple evaluation periods — Require 2-3 periods above threshold before alarming
Set up escalation — Important alerts → immediate notification; warnings → next business day
Document runbooks — Each alert should link to "What do I do?" documentation
Test alerts in staging — Trigger alarms before relying on them in production
Optimize metric ingestion — High-cardinality labels (user IDs) explode costs
Correlate logs and metrics — Use trace IDs to connect log entries to metric spikes
Monitor the monitoring — If dashboards are broken, you're flying blind

Hands-On: Multi-Cloud Monitoring

AWS CloudWatch

Deploy function with custom metrics:

aws lambda create-function \
    --function-name monitor-demo \
    --runtime python3.12 \
    --handler lambda_function.handler \
    --zip-file fileb://function.zip

Create dashboard:

aws cloudwatch put-dashboard \
  --dashboard-name OrderProcessing \
  --dashboard-body file://dashboard.json

Create alarm:

aws cloudwatch put-metric-alarm \
  --alarm-name HighErrorRate \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --statistic Sum \
  --period 300 \
  --threshold 5 \
  --comparison-operator GreaterThanOrEqualToThreshold

Google Cloud

Deploy function:

gcloud functions deploy monitor-demo \
    --runtime python312 \
    --trigger-http \
    --allow-unauthenticated

Create dashboard:

gcloud monitoring dashboards create --config-from-file=dashboard.yaml

Create alert policy:

gcloud alpha monitoring policies create \
  --notification-channels=$CHANNEL_ID \
  --display-name="High Error Rate" \
  --condition-display-name="Errors > 5/min"

Key Takeaway

Good monitoring lets you detect problems before customers do. Metrics reveal trends, dashboards provide overview, and alerts wake you when action is needed. Both platforms automate the collection—your job is choosing what to watch and when to sound the alarm.

Simple Explanation​

What it is​

Why we need it​

Benefits​

Tradeoffs​

Real-world examples (architecture only)​

Part 1: AWS CloudWatch Metrics & Alarms​

CloudWatch Metrics​

View Metrics in Console​

Custom Metrics​

CloudWatch Dashboards​

CloudWatch Alarms​

Metric Math​

Anomaly Detection​

SNS Notifications​

Part 2: Google Cloud Monitoring & Cloud Alerting​

Cloud Monitoring Metrics​

Custom Metrics​

Cloud Monitoring Dashboards​

Cloud Alerting Policies​

Notification Channels​

Uptime Checks​

CloudWatch (AWS) vs. Cloud Monitoring (GCP)​

Key Differences​

Best Practices (Both Platforms)​

Hands-On: Multi-Cloud Monitoring​

AWS CloudWatch​

Google Cloud​

Key Takeaway​

Simple Explanation

What it is

Why we need it

Benefits

Tradeoffs

Real-world examples (architecture only)

Part 1: AWS CloudWatch Metrics & Alarms

CloudWatch Metrics

View Metrics in Console

Custom Metrics

CloudWatch Dashboards

CloudWatch Alarms

Metric Math

Anomaly Detection

SNS Notifications

Part 2: Google Cloud Monitoring & Cloud Alerting

Cloud Monitoring Metrics

Custom Metrics

Cloud Monitoring Dashboards

Cloud Alerting Policies

Notification Channels

Uptime Checks

CloudWatch (AWS) vs. Cloud Monitoring (GCP)

Key Differences

Best Practices (Both Platforms)

Hands-On: Multi-Cloud Monitoring

AWS CloudWatch

Google Cloud

Key Takeaway