Traces

Publish at:

Trace correlation flow

Logs and traces checkpoint #

Secrets gave the platform a real path and used it for central Prometheus metrics. The policies checkpoint then added a basic admission boundary. Now the platform can continue with observability.

Metrics answer the aggregate question: are requests happening, are stages rendering, and how long are they taking? The next operational question is narrower and more practical:

What happened to this request?

Metrics do not answer that by themselves. For one slow or broken render, I need the logs for that request and the trace that shows how it moved through AWS, GCP, and Azure.

So we add two more observability signals:

  • logs from Kubernetes pods
  • distributed traces from the Mandelbrot request path

The shape is still intentionally small. Each cluster keeps local fallbacks, and Grafana Cloud becomes the cross-cluster view.

GitOps ownership #

The same Argo CD observability application now owns the expanded stack. The base already had Prometheus and Grafana. This branch adds Loki, Promtail, Jaeger, and the OpenTelemetry Collector:

platform/observability/base/
  loki-config.yaml
  loki-deployment.yaml
  loki-service.yaml
  promtail-service-account.yaml
  promtail-rbac.yaml
  promtail-config.yaml
  promtail-daemonset.yaml
  jaeger-deployment.yaml
  jaeger-service.yaml
  otel-collector-config.yaml
  otel-collector-deployment.yaml
  otel-collector-service.yaml

The cloud overlays add the provider-backed credentials:

platform/observability/overlays/aws/
  grafana-cloud-logs-external-secret.yaml
  grafana-cloud-traces-external-secret.yaml

platform/observability/overlays/gcp/
  grafana-cloud-logs-external-secret.yaml
  grafana-cloud-traces-external-secret.yaml

platform/observability/overlays/azure/
  grafana-cloud-logs-external-secret.yaml
  grafana-cloud-traces-external-secret.yaml

That keeps the secrets model established earlier. Git declares the workload shape and the secret references. The cloud secret backend stores the actual Grafana Cloud endpoints and tokens. External Secrets materializes those values into Kubernetes.

Logs #

Loki is the local log store. It is deliberately small here: one in-cluster service that can answer recent log queries for the local cluster. Promtail is the log shipper. It runs as a DaemonSet, reads pod logs from the node, attaches Kubernetes labels, parses JSON log fields, and pushes the result to two places:

Promtail -> local Loki
Promtail -> Grafana Cloud Logs

The Promtail clients show that split:

clients:
  - url: http://loki.observability.svc.cluster.local:3100/loki/api/v1/push
  - url: ${GRAFANA_CLOUD_LOKI_URL}
    basic_auth:
      username: ${GRAFANA_CLOUD_LOKI_USERNAME}
      password: ${GRAFANA_CLOUD_LOKI_PASSWORD}

The useful part is the pipeline. Promtail first decodes the container runtime log format, then parses the Mandelbrot JSON payload:

pipeline_stages:
  - cri: {}
  - json:
      expressions:
        level:
        message:
        traceId:
        spanId:
        platform:
        service:
        cloud:
        region:
  - labels:
      level:
      platform:
      service:
      cloud:
      region:
  - output:
      source: message

That gives log queries stable labels and keeps the visible message readable.

The Mandelbrot service now writes structured JSON logs. A render completion log carries the request identifiers and platform context:

{
  "time": "2026-05-23T08:15:00.000Z",
  "level": "info",
  "message": "mandelbrot render completed",
  "platform": "trinity",
  "service": "mandelbrot",
  "cloud": "aws",
  "region": "us-east-1",
  "traceId": "f3d75d28dfae23d5abea1b500ac3431e",
  "spanId": "56c89d3fde9c645e",
  "jobId": "m_mp84cacn_5kgis0t",
  "status": "complete"
}

The first useful LogQL query is simple:

{namespace="mandelbrot", app="mandelbrot", platform="trinity"} | json | traceId != ""

That query says: show Mandelbrot logs, parse the JSON payload, and keep only lines that can be correlated to a trace.

Traces #

A trace is the end-to-end story of one request. A span is one timed operation inside that story. For Mandelbrot, the trace follows the render request as it fans out into stage renders:

POST /api/render
  render stage aws
    POST /internal/render-stage
      render tile
  render stage gcp
    POST /internal/render-stage
      render tile
  render stage azure
    POST /internal/render-stage
      render tile

This branch keeps the application implementation small. It does not bring in a full OpenTelemetry SDK yet. Instead, the Node service creates Zipkin-format spans directly, propagates B3 headers across stage calls, and exports each span to two in-cluster endpoints:

env:
  - name: OTEL_SERVICE_NAME
    value: mandelbrot
  - name: ZIPKIN_ENDPOINTS
    value: http://jaeger.observability.svc.cluster.local:9411/api/v2/spans,http://otel-collector.observability.svc.cluster.local:9411/api/v2/spans
  - name: TRACE_SAMPLE_RATE
    value: "1"

The two endpoints have different jobs:

Jaeger:
  local trace fallback inside the cluster

OpenTelemetry Collector:
  receives Zipkin spans and exports them to Grafana Cloud Traces

The span export path in the app is explicit. The service posts a Zipkin span to every configured endpoint and logs export failures:

for (const endpoint of zipkinEndpoints) {
  fetch(endpoint, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify([zipkinSpan]),
  });
}

Silent telemetry failure is worse than no telemetry in a checkpoint like this, because it makes the platform look observable while data is being dropped.

Collector #

The OpenTelemetry Collector is the bridge between the local cluster and Grafana Cloud Traces. It listens for Zipkin spans on port 9411, batches them, and exports them over OTLP HTTP with basic auth:

receivers:
  zipkin:
    endpoint: 0.0.0.0:9411
processors:
  batch: {}
exporters:
  otlphttp/grafana_cloud:
    endpoint: ${env:GRAFANA_CLOUD_TEMPO_OTLP_HTTP_ENDPOINT}
    auth:
      authenticator: basicauth/grafana_cloud
service:
  pipelines:
    traces:
      receivers:
        - zipkin
      processors:
        - batch
      exporters:
        - otlphttp/grafana_cloud

The collector credentials come from the grafana-cloud-traces Kubernetes Secret. That secret is created by External Secrets from the provider backend.

The AWS overlay shows the shape:

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: grafana-cloud-traces
  namespace: observability
spec:
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: grafana-cloud-traces
    creationPolicy: Owner
    deletionPolicy: Retain
  data:
    - secretKey: endpoint
      remoteRef:
        key: trinity-dev-aws-grafana-cloud-traces-endpoint-mg
    - secretKey: username
      remoteRef:
        key: trinity-dev-aws-grafana-cloud-traces-username-mg
    - secretKey: password
      remoteRef:
        key: trinity-dev-aws-grafana-cloud-traces-password-mg

The -mg suffix is not a new concept in the platform. It is the branch's current AWS backend-name suffix, used to avoid names that were blocked by AWS Secrets Manager scheduled deletion. The important part is that the ExternalSecret keys match the actual backend names created for that cluster.

Grafana correlation #

Grafana is still provisioned from Git. It now gets three in-cluster data sources:

Prometheus -> metrics
Loki       -> logs
Jaeger     -> traces

The Loki data source includes a derived field that extracts traceId from JSON logs and links it to Jaeger:

- name: Loki
  uid: Loki
  type: loki
  url: http://loki.observability.svc.cluster.local:3100
  jsonData:
    derivedFields:
      - datasourceUid: Jaeger
        matcherRegex: '"traceId":"([a-f0-9]{16,32})"'
        name: TraceID
        url: '$${__value.raw}'

That gives the local workflow:

dashboard -> logs -> trace

Grafana Cloud gives the same kind of workflow across all three clusters. Prometheus remote-writes metrics, Promtail pushes logs, and the OpenTelemetry Collector exports traces. The common traceId ties the signals together.

Validation #

After Argo CD syncs the observability application, first check the local components:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get deployment loki jaeger otel-collector
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get daemonset promtail
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get service loki jaeger otel-collector

Then check that External Secrets materialized the remote credentials:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get externalsecret grafana-cloud-logs grafana-cloud-traces
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get secret grafana-cloud-logs grafana-cloud-traces

Generate a few renders through Front Door and query Grafana Cloud Logs:

{namespace="mandelbrot", app="mandelbrot", platform="trinity"} | json | traceId != ""

The useful labels should include:

app=mandelbrot
container=mandelbrot
namespace=mandelbrot
platform=trinity

The useful messages are application events, not only platform noise:

request completed
mandelbrot render completed
mandelbrot stage rendered

Then query Grafana Cloud Traces for the mandelbrot service. The trace table should show recent POST /api/render spans with durations in the hundreds of milliseconds:

service      name              duration
mandelbrot   POST /api/render  155 ms
mandelbrot   POST /api/render  209 ms
mandelbrot   POST /api/render  274 ms
mandelbrot   POST /api/render  305 ms

The local Jaeger fallback is checked through port-forwarding:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability port-forward service/jaeger 16686:16686

Open http://localhost:16686, select the mandelbrot service, and search for recent traces.

Troubleshooting #

The first run exposed two useful failure modes.

Promtail came up cleanly, but Grafana Cloud rejected the first log batches. The first error was 405 Method Not Allowed. That was a bad Loki URL: the Grafana Cloud host was right, but Promtail must post to the full push endpoint:

/loki/api/v1/push

After that, the error changed to 401 Unauthorized with invalid scope requested. That was a token and username problem, not a Kubernetes problem. The logs username must be the Grafana Cloud Logs user ID, and the token needs write access to that Loki instance.

Traces had a different failure mode. The OpenTelemetry Collector was healthy, and a synthetic Zipkin span posted from inside the cluster was accepted and exported. Real Mandelbrot renders still did not move the collector counters. That narrowed the problem to the application pod.

The deployment had the new ZIPKIN_ENDPOINTS environment variable, but GCP was still serving the old pod with only the local Jaeger endpoint. The replacement pod was pending because the one-node GKE cluster did not have enough free CPU for a rolling-update surge. This branch changes the Mandelbrot deployment strategy to Recreate:

spec:
  replicas: 1
  strategy:
    type: Recreate

For a real service, I would usually scale capacity or tune rollout settings more carefully. For this single-replica demo app, Recreate is acceptable because the goal is to avoid a tiny node getting stuck with both old and new pods during the same rollout.

The collector counters are the fastest trace check:

KUBECONFIG=./kubeconfig.gcp.yaml \
  kubectl -n observability port-forward deployment/otel-collector 8888:8888

Then in another shell:

curl -s localhost:8888/metrics | grep -iE \
  'otelcol_receiver_.*spans|otelcol_exporter_.*spans'

The expected direction is:

otelcol_receiver_accepted_spans       increases
otelcol_exporter_sent_spans           increases
otelcol_exporter_send_failed_spans    stays at 0

If receiver counters stay at zero after a render, the app is not reaching the collector. If receiver counters increase but exporter failures increase too, the issue is between the collector and Grafana Cloud.

Exit #

This checkpoint completes the first full observability loop. The platform now has local metrics, logs, and traces in every cluster, plus a centralized Grafana Cloud view across AWS, GCP, and Azure.

It is still intentionally lightweight. Prometheus, Loki, and Jaeger use in-cluster ephemeral storage as local fallbacks. The durable platform view is the remote one: metrics are remote-written, logs are pushed by Promtail, and traces are exported by the OpenTelemetry Collector.

The next platform gap is release safety. Now that the platform has application delivery, traffic, secrets, policies, metrics, logs, and traces, the next checkpoint changes Mandelbrot from ordinary deployment to controlled progressive delivery.

Source code #

Reference implementation (opens in a new tab)