Monitoring & Alarms: CloudWatch (AWS) & Cloud Monitoring (GCP)
Monitoring reveals system health in real-time. While logging captures detailed events, metrics track trends: invocations, errors, latency, and resource usage. Both AWS and GCP provide dashboards and alerting to notify you when something goes wrong.
Simple Explanation
What it is
Monitoring turns raw activity into charts and alerts, so you can see if your system is healthy without reading every log line.
Why we need it
Logs tell you what happened. Monitoring tells you when something is going wrong right now so you can act fast.
Benefits
- Early warning when errors or latency rise.
- Clear trends over time for capacity planning.
- Automated alerts instead of manual checking.
Tradeoffs
- Alert fatigue if thresholds are noisy.
- Setup time to build useful dashboards.
Real-world examples (architecture only)
- Error rate > 5% -> Pager alert -> Rollback.
- Duration spikes -> Investigate slow dependency.
Part 1: AWS CloudWatch Metrics & Alarms
CloudWatch Metrics
Lambda automatically publishes these metrics to CloudWatch:
- Invocations: Total function executions
- Duration: Execution time (milliseconds)
- Errors: Failed invocations
- Throttles: Request rejections due to concurrency limits
- ConcurrentExecutions: Simultaneous running functions
- UnreservedConcurrentExecutions: Overflow beyond reserved capacity
View Metrics in Console
- Navigate to Lambda > Your Function
- Click the Monitor tab
- View graphs for Invocations, Errors, Duration, Throttles
Custom Metrics
Publish application-specific metrics to CloudWatch:
import time
from datetime import datetime
import boto3
cloudwatch = boto3.client("cloudwatch")
def handler(event, context):
start_time = time.time()
try:
result = process_order(event)
duration_ms = int((time.time() - start_time) * 1000)
cloudwatch.put_metric_data(
Namespace="MyApp/Orders",
MetricData=[
{
"MetricName": "OrderProcessingTime",
"Value": duration_ms,
"Unit": "Milliseconds",
"Timestamp": datetime.utcnow(),
},
{
"MetricName": "OrderAmount",
"Value": event.get("amount", 0),
"Unit": "None",
"Timestamp": datetime.utcnow(),
},
],
)
return result
except Exception:
cloudwatch.put_metric_data(
Namespace="MyApp/Orders",
MetricData=[
{
"MetricName": "OrderErrors",
"Value": 1,
"Unit": "Count",
}
],
)
raise
CloudWatch Dashboards
Create custom dashboards via SAM:
MonitoringDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: OrderProcessingDashboard
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/Lambda", "Invocations", {"stat": "Sum"}],
["AWS/Lambda", "Errors", {"stat": "Sum"}],
["AWS/Lambda", "Duration", {"stat": "Average"}],
["MyApp/Orders", "OrderProcessingTime", {"stat": "Average"}]
],
"period": 300,
"stat": "Average",
"region": "${AWS::Region}",
"title": "Lambda Performance"
}
}
]
}
CloudWatch Alarms
Create alarms to trigger notifications:
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderProcessing-HighErrorRate
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 300 # 5 minutes
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref CriticalAlertTopic
HighDurationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderProcessing-HighDuration
MetricName: Duration
Namespace: AWS/Lambda
Statistic: Average
Period: 300
EvaluationPeriods: 2
Threshold: 10000 # 10 seconds
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref WarningTopic
ThrottlingAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: Lambda-Throttling-Alert
MetricName: Throttles
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref CriticalAlertTopic
Metric Math
Combine metrics into derived metrics:
ErrorRateMetric:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderErrorRate
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
MetricName: Errors
Namespace: AWS/Lambda
Period: 300
Stat: Sum
- Id: m2
ReturnData: false
MetricStat:
Metric:
MetricName: Invocations
Namespace: AWS/Lambda
Period: 300
Stat: Sum
- Id: error_rate
Expression: (m1/m2)*100
ReturnData: true
Threshold: 5
ComparisonOperator: GreaterThanThreshold
Anomaly Detection
Automatically detect abnormal patterns:
DurationAnomalyDetector:
Type: AWS::CloudWatch::AnomalyDetector
Properties:
MetricName: Duration
Namespace: AWS/Lambda
DurationAnomalyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: DurationAnomaly
Metrics:
- Id: m1
ReturnData: true
MetricStat:
Metric:
MetricName: Duration
Namespace: AWS/Lambda
Period: 300
Stat: Average
- Id: ad1
Expression: ANOMALY_DETECTION_BAND(m1, 2) # 2 standard deviations
ReturnData: true
ThresholdMetricId: ad1
ComparisonOperator: LessThanLowerOrGreaterThanUpperThreshold
SNS Notifications
Route alarms to email, SMS, or Slack:
CriticalAlertTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: CriticalAlerts
Subscription:
- Endpoint: [email protected]
Protocol: email
- Endpoint: +1234567890
Protocol: sms
SlackNotification:
Type: AWS::CloudFormation::CustomResource
Properties:
ServiceToken: !GetAtt SlackBridgeLambda.Arn
TopicArn: !Ref CriticalAlertTopic
SlackWebhookUrl: !Sub '{{resolve:secretsmanager:slack-webhook-url:SecretString:url}}'
Part 2: Google Cloud Monitoring & Cloud Alerting
Cloud Monitoring Metrics
Google Cloud automatically collects metrics on Cloud Function execution:
- execution_count: Total invocations
- execution_times: Latency distribution
- execution_return_code: success (0) or error (non-zero)
- user_memory_bytes: Memory allocated
- gen_ai_tokens: Tokens counted for AI workloads
Custom Metrics
Send custom metrics using the Monitoring API:
import time
import os
import functions_framework
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "PROJECT_ID")
project_name = f"projects/{project_id}"
@functions_framework.http
def process_order(request):
start_time = time.time()
payload = request.get_json(silent=True) or {}
try:
result = process_order_logic(payload)
duration_ms = (time.time() - start_time) * 1000
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/order_processing_time"
series.resource.type = "cloud_function"
series.resource.labels["function_name"] = "processOrder"
point = series.points.add()
point.value.double_value = duration_ms
point.interval.end_time.seconds = int(time.time())
client.create_time_series(name=project_name, time_series=[series])
return ({"success": True, "result": result}, 200)
except Exception as exc:
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/order_errors"
series.resource.type = "cloud_function"
series.resource.labels["function_name"] = "processOrder"
point = series.points.add()
point.value.int64_value = 1
point.interval.end_time.seconds = int(time.time())
client.create_time_series(name=project_name, time_series=[series])
return ({"error": str(exc)}, 500)
Cloud Monitoring Dashboards
Create dashboards in Cloud Console or via Terraform:
from google.cloud import monitoring_dashboards
dashboard = {
'display_name': 'Order Processing Dashboard',
'grid_layout': {
'widgets': [
{
'title': 'Execution Count',
'xy_chart': {
'data_sets': [
{
'time_series_query': {
'time_series_filter': {
'filter': 'resource.type="cloud_function" AND resource.labels.function_name="processOrder"',
'aggregation': {
'alignment_period': '60s',
'per_series_aligner': 'ALIGN_RATE'
}
}
}
}
]
}
},
{
'title': 'Execution Time (P99)',
'xy_chart': {
'data_sets': [
{
'time_series_query': {
'time_series_filter': {
'filter': 'metric.type="cloudfunctions.googleapis.com/execution_times"',
'aggregation': {
'alignment_period': '60s',
'per_series_aligner': 'ALIGN_PERCENTILE_99'
}
}
}
}
]
}
}
]
}
}
client = monitoring_dashboards.DashboardsClient()
dashboard = client.create_dashboard(name=f'projects/{PROJECT_ID}', dashboard=dashboard)
Cloud Alerting Policies
Create alert policies via Cloud Console or code:
from google.cloud import monitoring_v3
client = monitoring_v3.AlertPolicyServiceClient()
project = f'projects/{PROJECT_ID}'
# Alert on high error rate
policy = {
'display_name': 'High Error Rate - Order Processing',
'conditions': [
{
'display_name': 'Errors per minute > 5',
'condition_threshold': {
'filter': 'resource.type="cloud_function" AND resource.labels.function_name="processOrder" AND metric.type="cloudfunctions.googleapis.com/execution_return_code" AND metric.labels.status !="ok"',
'comparison': monitoring_v3.ComparisonType.COMPARISON_GT,
'threshold_value': 5,
'duration': {'seconds': 300},
'aggregations': [
{
'alignment_period': {'seconds': 60},
'per_series_aligner': monitoring_v3.Aggregation.ALIGN_RATE
}
]
}
}
],
'notification_channels': [webhook_channel_id],
'alert_strategy': {
'auto_close': {'duration': {'seconds': 1800}} # Auto-close after 30 min
}
}
alert_policy = client.create_alert_policy(
name=project,
alert_policy=policy
)
Notification Channels
Route alerts to email, SMS, Slack, PagerDuty:
channels_client = monitoring_v3.NotificationChannelServiceClient()
project = f'projects/{PROJECT_ID}'
# Email notification
email_channel = {
'type': 'email',
'display_name': 'Team Email',
'labels': {
'email_address': '[email protected]'
},
'enabled': True
}
channel = channels_client.create_notification_channel(
name=project,
notification_channel=email_channel
)
# Slack webhook
slack_channel = {
'type': 'slack',
'display_name': 'Slack Alerts',
'labels': {
'channel_name': '#incidents'
},
'enabled': True
}
channel = channels_client.create_notification_channel(
name=project,
notification_channel=slack_channel
)
Uptime Checks
Monitor Cloud Function endpoints:
uptime_check = {
'display_name': 'API Health Check',
'monitored_resource': {
'type': 'uptime_url',
'labels': {
'host': 'api.example.com'
}
},
'http_check': {
'path': '/health',
'port': 443,
'request_method': 'GET',
'use_ssl': True,
'accept_redirect': True
},
'period': {'seconds': 60},
'timeout': {'seconds': 10},
'selected_regions': ['USA', 'EUROPE', 'ASIA_PACIFIC']
}
uptime_client = monitoring_v3.UptimeCheckServiceClient()
uptime_client.create_uptime_check_config(
parent=f'projects/{PROJECT_ID}',
uptime_check_config=uptime_check
)
CloudWatch (AWS) vs. Cloud Monitoring (GCP)
| Feature | CloudWatch | Cloud Monitoring |
|---|---|---|
| Default metrics | Invocations, Errors, Duration, Throttles | Execution count, execution_times, return_code |
| Metric retention | 15 months | 5 weeks (standard) |
| Dashboard setup | Via Console or CloudFormation | Via Console or Terraform |
| Custom metrics | AWS SDK PutMetricDataCommand | @google-cloud/monitoring library |
| Query language | CloudWatch Metrics Math | Cloud Monitoring MQL (Monitoring Query Language) |
| Alarms | Threshold-based, anomaly detection | Threshold-based, multi-condition |
| Alerting channels | SNS (email, SMS, SQS, Lambda) | Email, SMS, Slack, Page Duty, webhook |
| Log-based metrics | CloudWatch Logs Insights | Cloud Logging → Monitoring |
| Pricing | $0.10 per custom metric per month | Free for most services |
| Uptime monitoring | Route 53 Health Checks | Cloud Monitoring Uptime Checks |
Key Differences
- Retention: CloudWatch keeps 15 months of detailed metrics; Cloud Monitoring keeps 5 weeks by default
- Pricing: CloudWatch charges per custom metric; Cloud Monitoring is free for GCP services
- Query power: CloudWatch Math is simpler; Cloud Monitoring uses MQL for complex queries
- Default coverage: Both auto-collect function metrics, but different names/labels
Best Practices (Both Platforms)
- Alert on impact, not noise — High error rate matters; single error usually doesn't
- Use multiple evaluation periods — Require 2-3 periods above threshold before alarming
- Set up escalation — Important alerts → immediate notification; warnings → next business day
- Document runbooks — Each alert should link to "What do I do?" documentation
- Test alerts in staging — Trigger alarms before relying on them in production
- Optimize metric ingestion — High-cardinality labels (user IDs) explode costs
- Correlate logs and metrics — Use trace IDs to connect log entries to metric spikes
- Monitor the monitoring — If dashboards are broken, you're flying blind
Hands-On: Multi-Cloud Monitoring
AWS CloudWatch
- Deploy function with custom metrics:
aws lambda create-function \
--function-name monitor-demo \
--runtime python3.12 \
--handler lambda_function.handler \
--zip-file fileb://function.zip
- Create dashboard:
aws cloudwatch put-dashboard \
--dashboard-name OrderProcessing \
--dashboard-body file://dashboard.json
- Create alarm:
aws cloudwatch put-metric-alarm \
--alarm-name HighErrorRate \
--metric-name Errors \
--namespace AWS/Lambda \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanOrEqualToThreshold
Google Cloud
- Deploy function:
gcloud functions deploy monitor-demo \
--runtime python312 \
--trigger-http \
--allow-unauthenticated
- Create dashboard:
gcloud monitoring dashboards create --config-from-file=dashboard.yaml
- Create alert policy:
gcloud alpha monitoring policies create \
--notification-channels=$CHANNEL_ID \
--display-name="High Error Rate" \
--condition-display-name="Errors > 5/min"
Key Takeaway
Good monitoring lets you detect problems before customers do. Metrics reveal trends, dashboards provide overview, and alerts wake you when action is needed. Both platforms automate the collection—your job is choosing what to watch and when to sound the alarm.