Skip to main content

Monitoring & Alarms: CloudWatch (AWS) & Cloud Monitoring (GCP)

Monitoring reveals system health in real-time. While logging captures detailed events, metrics track trends: invocations, errors, latency, and resource usage. Both AWS and GCP provide dashboards and alerting to notify you when something goes wrong.


Simple Explanation

What it is

Monitoring turns raw activity into charts and alerts, so you can see if your system is healthy without reading every log line.

Why we need it

Logs tell you what happened. Monitoring tells you when something is going wrong right now so you can act fast.

Benefits

  • Early warning when errors or latency rise.
  • Clear trends over time for capacity planning.
  • Automated alerts instead of manual checking.

Tradeoffs

  • Alert fatigue if thresholds are noisy.
  • Setup time to build useful dashboards.

Real-world examples (architecture only)

  • Error rate > 5% -> Pager alert -> Rollback.
  • Duration spikes -> Investigate slow dependency.

Part 1: AWS CloudWatch Metrics & Alarms

CloudWatch Metrics

Lambda automatically publishes these metrics to CloudWatch:

  • Invocations: Total function executions
  • Duration: Execution time (milliseconds)
  • Errors: Failed invocations
  • Throttles: Request rejections due to concurrency limits
  • ConcurrentExecutions: Simultaneous running functions
  • UnreservedConcurrentExecutions: Overflow beyond reserved capacity

View Metrics in Console

  1. Navigate to Lambda > Your Function
  2. Click the Monitor tab
  3. View graphs for Invocations, Errors, Duration, Throttles

Custom Metrics

Publish application-specific metrics to CloudWatch:

import time
from datetime import datetime

import boto3

cloudwatch = boto3.client("cloudwatch")


def handler(event, context):
start_time = time.time()

try:
result = process_order(event)
duration_ms = int((time.time() - start_time) * 1000)

cloudwatch.put_metric_data(
Namespace="MyApp/Orders",
MetricData=[
{
"MetricName": "OrderProcessingTime",
"Value": duration_ms,
"Unit": "Milliseconds",
"Timestamp": datetime.utcnow(),
},
{
"MetricName": "OrderAmount",
"Value": event.get("amount", 0),
"Unit": "None",
"Timestamp": datetime.utcnow(),
},
],
)

return result
except Exception:
cloudwatch.put_metric_data(
Namespace="MyApp/Orders",
MetricData=[
{
"MetricName": "OrderErrors",
"Value": 1,
"Unit": "Count",
}
],
)
raise

CloudWatch Dashboards

Create custom dashboards via SAM:

MonitoringDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: OrderProcessingDashboard
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Lambda", "Invocations", {"stat": "Sum"}],
["AWS/Lambda", "Errors", {"stat": "Sum"}],
["AWS/Lambda", "Duration", {"stat": "Average"}],
["MyApp/Orders", "OrderProcessingTime", {"stat": "Average"}]
],
"period": 300,
"stat": "Average",
"region": "${AWS::Region}",
"title": "Lambda Performance"
}
}
]
}

CloudWatch Alarms

Create alarms to trigger notifications:

HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderProcessing-HighErrorRate
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 300 # 5 minutes
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref CriticalAlertTopic

HighDurationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderProcessing-HighDuration
MetricName: Duration
Namespace: AWS/Lambda
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 10000 # 10 seconds
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref WarningTopic

ThrottlingAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: Lambda-Throttling-Alert
MetricName: Throttles
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref CriticalAlertTopic

Metric Math

Combine metrics into derived metrics:

ErrorRateMetric:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderErrorRate
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
MetricName: Errors
Namespace: AWS/Lambda
Period: 300
Stat: Sum
- Id: m2
ReturnData: false
MetricStat:
Metric:
MetricName: Invocations
Namespace: AWS/Lambda
Period: 300
Stat: Sum
- Id: error_rate
Expression: (m1/m2)*100
ReturnData: true
Threshold: 5
ComparisonOperator: GreaterThanThreshold

Anomaly Detection

Automatically detect abnormal patterns:

DurationAnomalyDetector:
Type: AWS::CloudWatch::AnomalyDetector
Properties:
MetricName: Duration
Namespace: AWS/Lambda

DurationAnomalyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: DurationAnomaly
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
MetricName: Duration
Namespace: AWS/Lambda
Period: 300
Stat: Average
- Id: ad1
Expression: ANOMALY_DETECTION_BAND(m1, 2) # 2 standard deviations
ReturnData: true
ThresholdMetricId: ad1
ComparisonOperator: LessThanLowerOrGreaterThanUpperThreshold

SNS Notifications

Route alarms to email, SMS, or Slack:

CriticalAlertTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: CriticalAlerts
Subscription:
- Endpoint: [email protected]
Protocol: email
- Endpoint: +1234567890
Protocol: sms

SlackNotification:
Type: AWS::CloudFormation::CustomResource
Properties:
ServiceToken: !GetAtt SlackBridgeLambda.Arn
TopicArn: !Ref CriticalAlertTopic
SlackWebhookUrl: !Sub '{{resolve:secretsmanager:slack-webhook-url:SecretString:url}}'

Part 2: Google Cloud Monitoring & Cloud Alerting

Cloud Monitoring Metrics

Google Cloud automatically collects metrics on Cloud Function execution:

  • execution_count: Total invocations
  • execution_times: Latency distribution
  • execution_return_code: success (0) or error (non-zero)
  • user_memory_bytes: Memory allocated
  • gen_ai_tokens: Tokens counted for AI workloads

Custom Metrics

Send custom metrics using the Monitoring API:

import time

import os

import functions_framework
from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "PROJECT_ID")
project_name = f"projects/{project_id}"


@functions_framework.http
def process_order(request):
start_time = time.time()
payload = request.get_json(silent=True) or {}

try:
result = process_order_logic(payload)
duration_ms = (time.time() - start_time) * 1000

series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/order_processing_time"
series.resource.type = "cloud_function"
series.resource.labels["function_name"] = "processOrder"
point = series.points.add()
point.value.double_value = duration_ms
point.interval.end_time.seconds = int(time.time())

client.create_time_series(name=project_name, time_series=[series])

return ({"success": True, "result": result}, 200)
except Exception as exc:
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/order_errors"
series.resource.type = "cloud_function"
series.resource.labels["function_name"] = "processOrder"
point = series.points.add()
point.value.int64_value = 1
point.interval.end_time.seconds = int(time.time())

client.create_time_series(name=project_name, time_series=[series])

return ({"error": str(exc)}, 500)

Cloud Monitoring Dashboards

Create dashboards in Cloud Console or via Terraform:

from google.cloud import monitoring_dashboards

dashboard = {
'display_name': 'Order Processing Dashboard',
'grid_layout': {
'widgets': [
{
'title': 'Execution Count',
'xy_chart': {
'data_sets': [
{
'time_series_query': {
'time_series_filter': {
'filter': 'resource.type="cloud_function" AND resource.labels.function_name="processOrder"',
'aggregation': {
'alignment_period': '60s',
'per_series_aligner': 'ALIGN_RATE'
}
}
}
}
]
}
},
{
'title': 'Execution Time (P99)',
'xy_chart': {
'data_sets': [
{
'time_series_query': {
'time_series_filter': {
'filter': 'metric.type="cloudfunctions.googleapis.com/execution_times"',
'aggregation': {
'alignment_period': '60s',
'per_series_aligner': 'ALIGN_PERCENTILE_99'
}
}
}
}
]
}
}
]
}
}

client = monitoring_dashboards.DashboardsClient()
dashboard = client.create_dashboard(name=f'projects/{PROJECT_ID}', dashboard=dashboard)

Cloud Alerting Policies

Create alert policies via Cloud Console or code:

from google.cloud import monitoring_v3

client = monitoring_v3.AlertPolicyServiceClient()
project = f'projects/{PROJECT_ID}'

# Alert on high error rate
policy = {
'display_name': 'High Error Rate - Order Processing',
'conditions': [
{
'display_name': 'Errors per minute > 5',
'condition_threshold': {
'filter': 'resource.type="cloud_function" AND resource.labels.function_name="processOrder" AND metric.type="cloudfunctions.googleapis.com/execution_return_code" AND metric.labels.status !="ok"',
'comparison': monitoring_v3.ComparisonType.COMPARISON_GT,
'threshold_value': 5,
'duration': {'seconds': 300},
'aggregations': [
{
'alignment_period': {'seconds': 60},
'per_series_aligner': monitoring_v3.Aggregation.ALIGN_RATE
}
]
}
}
],
'notification_channels': [webhook_channel_id],
'alert_strategy': {
'auto_close': {'duration': {'seconds': 1800}} # Auto-close after 30 min
}
}

alert_policy = client.create_alert_policy(
name=project,
alert_policy=policy
)

Notification Channels

Route alerts to email, SMS, Slack, PagerDuty:

channels_client = monitoring_v3.NotificationChannelServiceClient()
project = f'projects/{PROJECT_ID}'

# Email notification
email_channel = {
'type': 'email',
'display_name': 'Team Email',
'labels': {
'email_address': '[email protected]'
},
'enabled': True
}

channel = channels_client.create_notification_channel(
name=project,
notification_channel=email_channel
)

# Slack webhook
slack_channel = {
'type': 'slack',
'display_name': 'Slack Alerts',
'labels': {
'channel_name': '#incidents'
},
'enabled': True
}

channel = channels_client.create_notification_channel(
name=project,
notification_channel=slack_channel
)

Uptime Checks

Monitor Cloud Function endpoints:

uptime_check = {
'display_name': 'API Health Check',
'monitored_resource': {
'type': 'uptime_url',
'labels': {
'host': 'api.example.com'
}
},
'http_check': {
'path': '/health',
'port': 443,
'request_method': 'GET',
'use_ssl': True,
'accept_redirect': True
},
'period': {'seconds': 60},
'timeout': {'seconds': 10},
'selected_regions': ['USA', 'EUROPE', 'ASIA_PACIFIC']
}

uptime_client = monitoring_v3.UptimeCheckServiceClient()
uptime_client.create_uptime_check_config(
parent=f'projects/{PROJECT_ID}',
uptime_check_config=uptime_check
)

CloudWatch (AWS) vs. Cloud Monitoring (GCP)

FeatureCloudWatchCloud Monitoring
Default metricsInvocations, Errors, Duration, ThrottlesExecution count, execution_times, return_code
Metric retention15 months5 weeks (standard)
Dashboard setupVia Console or CloudFormationVia Console or Terraform
Custom metricsAWS SDK PutMetricDataCommand@google-cloud/monitoring library
Query languageCloudWatch Metrics MathCloud Monitoring MQL (Monitoring Query Language)
AlarmsThreshold-based, anomaly detectionThreshold-based, multi-condition
Alerting channelsSNS (email, SMS, SQS, Lambda)Email, SMS, Slack, Page Duty, webhook
Log-based metricsCloudWatch Logs InsightsCloud Logging → Monitoring
Pricing$0.10 per custom metric per monthFree for most services
Uptime monitoringRoute 53 Health ChecksCloud Monitoring Uptime Checks

Key Differences

  • Retention: CloudWatch keeps 15 months of detailed metrics; Cloud Monitoring keeps 5 weeks by default
  • Pricing: CloudWatch charges per custom metric; Cloud Monitoring is free for GCP services
  • Query power: CloudWatch Math is simpler; Cloud Monitoring uses MQL for complex queries
  • Default coverage: Both auto-collect function metrics, but different names/labels

Best Practices (Both Platforms)

  1. Alert on impact, not noise — High error rate matters; single error usually doesn't
  2. Use multiple evaluation periods — Require 2-3 periods above threshold before alarming
  3. Set up escalation — Important alerts → immediate notification; warnings → next business day
  4. Document runbooks — Each alert should link to "What do I do?" documentation
  5. Test alerts in staging — Trigger alarms before relying on them in production
  6. Optimize metric ingestion — High-cardinality labels (user IDs) explode costs
  7. Correlate logs and metrics — Use trace IDs to connect log entries to metric spikes
  8. Monitor the monitoring — If dashboards are broken, you're flying blind

Hands-On: Multi-Cloud Monitoring

AWS CloudWatch

  1. Deploy function with custom metrics:
aws lambda create-function \
--function-name monitor-demo \
--runtime python3.12 \
--handler lambda_function.handler \
--zip-file fileb://function.zip
  1. Create dashboard:
aws cloudwatch put-dashboard \
--dashboard-name OrderProcessing \
--dashboard-body file://dashboard.json
  1. Create alarm:
aws cloudwatch put-metric-alarm \
--alarm-name HighErrorRate \
--metric-name Errors \
--namespace AWS/Lambda \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold

Google Cloud

  1. Deploy function:
gcloud functions deploy monitor-demo \
--runtime python312 \
--trigger-http \
--allow-unauthenticated
  1. Create dashboard:
gcloud monitoring dashboards create --config-from-file=dashboard.yaml
  1. Create alert policy:
gcloud alpha monitoring policies create \
--notification-channels=$CHANNEL_ID \
--display-name="High Error Rate" \
--condition-display-name="Errors > 5/min"

Key Takeaway

Good monitoring lets you detect problems before customers do. Metrics reveal trends, dashboards provide overview, and alerts wake you when action is needed. Both platforms automate the collection—your job is choosing what to watch and when to sound the alarm.