Concept
Cloud Comparisons

Observability: CloudWatch vs Cloud Monitoring vs Azure Monitor

What each catches and where they fall short.

Each cloud ships a first-party observability stack: CloudWatch on AWS, Cloud Monitoring (Stackdriver) on GCP, Azure Monitor on Azure. Each one gives you metrics, logs, and (usually) traces, integrated with the cloud’s own services and IAM. They are all serviceable; none of them dominate the dedicated observability market (Datadog, Honeycomb, New Relic, Grafana Cloud, Splunk). The question is rarely “which is best” but “first-party for the integration, third-party for the experience, or both.”

The three at a glance

flowchart TB
    subgraph CW["AWS CloudWatch"]
        direction LR
        C1[("Metrics, Logs, Alarms,<br/>X-Ray for traces")]:::server
        C2[("Logs Insights for log queries")]:::server
        C3[("automatic for AWS services;<br/>configurable for apps")]:::server
    end

    subgraph CM["GCP Cloud Monitoring + Logging"]
        direction LR
        G1[("Cloud Monitoring (metrics)<br/>Cloud Logging (logs)<br/>Cloud Trace (traces)")]:::server
        G2[("Log Explorer with structured queries")]:::server
        G3[("automatic for GCP services;<br/>OpenTelemetry-native")]:::server
    end

    subgraph AM["Azure Monitor"]
        direction LR
        A1[("Metrics, Log Analytics (KQL),<br/>Application Insights")]:::server
        A2[("KQL is unusually powerful<br/>(SQL-like, joins, summarise)")]:::server
        A3[("good integration with App Service / Functions")]:::server
    end

    classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px

What each is genuinely good at

  • CloudWatch. Metrics on every AWS service, automatically. Alarms wire cleanly into auto-scaling, EventBridge, SNS. Logs Insights is competent but the UI is dated.
  • Cloud Monitoring. OpenTelemetry-native, clean integration with GKE, BigQuery sink for log analytics. The query language for logs is more straightforward than CloudWatch’s.
  • Azure Monitor. KQL (Kusto Query Language) is genuinely best-in-class for log analytics — SQL-like with first-class time-series operators. Application Insights is the strongest of the three for application-level tracing of .NET workloads.

Where they all fall short

flowchart TB
    F1["Multi-cloud or hybrid<br/>None of the three handles non-native cloud well<br/>(cross-cloud aggregation is third-party territory)"]:::weak
    F2["Trace exploration<br/>Dedicated tools (Honeycomb, Datadog APM, Tempo)<br/>have richer high-cardinality trace UIs"]:::weak
    F3["Alert routing<br/>First-party alerting goes to email / SNS / PagerDuty<br/>Less flexible than dedicated tools (Datadog, Opsgenie)"]:::weak
    F4["Cost at scale<br/>All three charge per GB ingested + retention<br/>Log volumes can produce surprising bills;<br/>dedicated tools sometimes cheaper at scale"]:::weak
    F5["Dashboard UX<br/>All three lag dedicated tools (Grafana, Datadog dashboards)<br/>in expressiveness and shareability"]:::weak

    classDef weak fill:#fed7aa,stroke:#c2410c,color:#7c2d12,stroke-width:1.5px

The most common pattern in production: first-party for infrastructure metrics + audit logs + cloud-native alerting; third-party for application observability + dashboards + incident response.

The hybrid pattern most teams adopt

flowchart TB
    CLOUD[("Cloud-native (CloudWatch /<br/>Cloud Monitoring / Azure Monitor)")]:::server
    APP[("Application code")]:::server
    OTEL[["OpenTelemetry collector"]]:::infra
    THIRD[("Third-party observability<br/>Datadog, Grafana Cloud, Honeycomb,<br/>New Relic, Splunk")]:::store

    CLOUD -->|"infra metrics,<br/>audit logs"| OTEL
    APP -->|"app metrics, logs, traces"| OTEL
    OTEL -->|"emit everything in one place"| THIRD

    UI(["Dashboards + alerting<br/>+ incident response"]):::client
    THIRD --> UI

    classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
    classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
    classDef store fill:#e9d5ff,stroke:#7e22ce,color:#581c87,stroke-width:1.5px
    classDef client fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px

This hybrid stack is now the dominant production pattern. OpenTelemetry makes the wiring portable: instrument once, ship metrics, logs, and traces to one or many destinations, swap backends without re-instrumenting.

When to stay all-cloud-native

  • Small team, single cloud, modest scale. The cloud’s first-party stack is the path of least resistance.
  • Compliance / data-residency requirements that complicate shipping data to a third-party SaaS.
  • Cost discipline: third-party observability bills can exceed compute bills surprisingly fast.

When to go third-party

  • Multi-cloud or hybrid. Dedicated tools were built for this.
  • Application observability matters more than infrastructure observability.
  • The team values UI / dashboard quality and incident-response tooling.
  • Per-engineer productivity gains from a better tool justify the price.

Pick within each cloud

flowchart TB
    Q1{"Which cloud?"}:::query
    A1["CloudWatch + X-Ray (or third-party APM).<br/>For AWS-native default observability."]:::strong
    A2["Cloud Monitoring + Cloud Trace.<br/>For GCP. OpenTelemetry-friendly."]:::strong
    A3["Azure Monitor + Application Insights.<br/>For Azure. KQL gives the best log analytics."]:::strong

    Q1 -->|"AWS"| A1
    Q1 -->|"GCP"| A2
    Q1 -->|"Azure"| A3

    classDef query fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
    classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px

Common mistakes

  • Cloud-native for application observability and dashboards. The UIs are competent but rarely loved. Most teams outgrow them.
  • No correlation across services. Without consistent trace IDs and request IDs, the three pillars do not connect during an incident.
  • Sending every log line. Log volume drives cost. Sample debug, keep errors and audit at 100%.
  • No retention policy. Logs and metrics retained forever at hot-tier prices. Tier to warm/cold; delete what is past compliance.
  • OpenTelemetry without a destination strategy. OTel is the wiring; the backend is the decision. Pick one or two; do not multicast everywhere.
  • Manual instrumentation forever. Auto-instrumentation libraries cover most frameworks today. Use them; only instrument by hand where they fall short.
  • Alerts on every metric. Alert fatigue is real. Alert on user-facing symptoms, not on every internal signal.

Quick recap

  • All three clouds offer competent first-party observability stacks.
  • KQL (Azure) is the best log query language of the three.
  • Most production teams use a hybrid: first-party for infra + audit, third-party for application observability.
  • OpenTelemetry is the portable instrumentation layer; pick the backend separately.
  • Retention, sampling, and alert hygiene are the universal cost and signal disciplines.

This concept sits in Stage 4 (Scaling and reliability) of the System Design Roadmap.