Secrets
Secrets checkpoint #
The local observability checkpoint proved that each cluster could collect its own metrics. The next step is a shared backend, and that needs credentials. Grafana Cloud remote_write and any real SaaS dependency create the same question: where does the secret live, and how does it reach Kubernetes without being committed to Git?
So the next checkpoint is secrets. A platform needs a repeatable path from a real secret backend into each cluster. Secrets cannot be stored in Git. Manual creation via kubectl create secret also is not acceptable.
External Secrets Operator is the right next move. It keeps the Kubernetes desired state in Git while keeping secret values in a real secret backend. Git declares that a Kubernetes Secret should exist. The operator reads the value from a provider such as AWS Secrets Manager, Google Secret Manager, or Azure Key Vault and materializes the Kubernetes Secret inside the cluster.
That gives the platform the split it needs:
Git:
secret shape, target name, provider reference
Cloud secret backend:
secret value
External Secrets Operator:
reconciliation between the two
Operator first #
The first implementation step installs the operator in each cluster through Argo CD:
trinity-secrets-aws
trinity-secrets-gcp
trinity-secrets-azure
The AWS application is typical:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: trinity-secrets-aws
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
project: trinity
source:
repoURL: https://charts.external-secrets.io
chart: external-secrets
targetRevision: 2.4.1
helm:
releaseName: external-secrets
values: |
installCRDs: true
serviceAccount:
create: false
name: external-secrets
destination:
server: https://kubernetes.default.svc
namespace: external-secrets
Those applications use the upstream External Secrets Helm chart, pinned to a specific chart version, and deploy into the external-secrets namespace. Because External Secrets Operator is installed from an external Helm chart and creates cluster-wide Kubernetes resources, the Argo CD project had to allow both the chart repository and those cluster-scoped resource types.
The sync order also changes slightly:
wave -1: Argo CD project
wave 0: applications
wave 1: secrets operator
wave 2: secrets demo and observability
Before any platform component starts asking for synced secrets, the operator should be present. We need to mount the observability application to Grafana Cloud credentials from External Secrets.
The live check is intentionally small:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n argocd get application trinity-secrets-aws
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n external-secrets get deployment,pods
KUBECONFIG=./kubeconfig.aws.yaml kubectl get crd | grep external-secrets
After syncing the clusters, the AWS root application showed the new secrets app alongside the existing workloads:
NAME SYNC STATUS HEALTH STATUS
trinity-dev-aws-root Synced Healthy
trinity-hello-aws Synced Healthy
trinity-mandelbrot-aws Synced Healthy
trinity-observability-aws Synced Healthy
trinity-secrets-aws Synced Healthy
The same operator check passed in the other clusters.
But installing the operator is only half the checkpoint. The important part is whether the operator can reach a real provider backend without static credentials in Git or Kubernetes.
Provider-backed proof #
So the next slice adds one harmless test secret per cloud:
hello-from-aws
hello-from-gcp
hello-from-azure
Those are just proof values. The point is to prove the path:
cloud secret backend -> External Secrets Operator -> Kubernetes Secret
Pulumi owns the cloud-side identity and backend resources. Argo CD owns the Kubernetes declaration that asks for a secret to be synced.
The split looks like this:
- AWS: Secrets Manager, an EKS OIDC provider, an IAM role for service accounts, and an
external-secretsservice account annotated with the role ARN - GCP: Google Secret Manager, GKE Workload Identity, a Google service account, an IAM binding from the Kubernetes service account to that Google service account, and Secret Manager accessor permissions
- Azure: Key Vault, AKS OIDC and workload identity, a user-assigned managed identity, a federated identity credential, and an
external-secretsservice account annotated with the Azure client and tenant IDs
The secret values come from Pulumi secret config:
pulumi -C infra/pulumi/aws config set --secret trinity:secretsDemoValue hello-from-aws
pulumi -C infra/pulumi/gcp config set --secret trinity:secretsDemoValue hello-from-gcp
pulumi -C infra/pulumi/azure config set --secret trinity:secretsDemoValue hello-from-azure
That keeps the repository clean. Git contains the desired shape of the sync, not the value.
Cloud identity #
The GCP leg exposed a useful platform detail. The deploy identity could create the cluster, but the secrets checkpoint needed more project-level permissions: service account administration, Secret Manager administration, and service usage administration. Secret Manager also needs the relevant project APIs enabled before the stack can create the backend resources. It is a cloud-project bootstrap boundary. That is the common pattern in this checkpoint. The Kubernetes manifest is small, but the cloud identity work behind it is provider-specific. Each cluster needs a native way for the external-secrets service account to read from its cloud secret backend without a long-lived credential:
AWS:
Kubernetes service account
-> projected service account token
-> EKS OIDC provider
-> IAM role
-> AWS Secrets Manager
GCP:
Kubernetes service account
-> GKE Workload Identity
-> Google service account
-> Google Secret Manager
Azure:
Kubernetes service account
-> AKS OIDC issuer
-> federated identity credential
-> user-assigned managed identity
-> Azure Key Vault
The important part is that each provider uses short-lived, identity-based access instead of static credentials copied into the cluster.
GitOps shape #
The GitOps side lives under platform/secrets-demo:
platform/secrets-demo/
base/
namespace.yaml
overlays/
aws/
gcp/
azure/
Each overlay adds a provider-specific ClusterSecretStore and one ExternalSecret. The AWS store, for example, uses the service account token to assume the IAM role:
apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
name: aws-secrets-manager
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets
namespace: external-secrets
The ExternalSecret is deliberately small:
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: provider-test-secret
namespace: secrets-demo
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: aws-secrets-manager
target:
name: provider-test-secret
creationPolicy: Owner
deletionPolicy: Retain
data:
- secretKey: message
remoteRef:
key: trinity-dev-aws-secrets-demo
conversionStrategy: Default
decodingStrategy: None
metadataPolicy: None
nullBytePolicy: Ignore
The default-looking fields at the bottom are intentional. External Secrets defaults them into the live object:
conversionStrategy: Default
decodingStrategy: None
metadataPolicy: None
nullBytePolicy: Ignore
Without pinning those fields in Git, Argo CD keeps seeing the ExternalSecret as OutOfSync even though the secret has synced. That is the kind of small reconciler mismatch that is easy to dismiss, but it matters in a GitOps platform.
Healthy but permanently out-of-sync resources train operators to ignore drift.
The root Argo CD application now creates one backend checkpoint application per cloud:
trinity-secrets-demo-aws
trinity-secrets-demo-gcp
trinity-secrets-demo-azure
Validation #
After the stacks and Argo CD applications reconciled, each cluster reported the same state:
for cloud in aws gcp azure; do
KUBECONFIG=./kubeconfig.${cloud}.yaml \
kubectl -n secrets-demo get externalsecret provider-test-secret
done
NAME STORETYPE STORE REFRESH INTERVAL STATUS READY
provider-test-secret ClusterSecretStore aws-secrets-manager 1h SecretSynced True
provider-test-secret ClusterSecretStore gcp-secret-manager 1h SecretSynced True
provider-test-secret ClusterSecretStore azure-key-vault 1h SecretSynced True
And the synced Kubernetes secrets contained the expected harmless values:
for cloud in aws gcp azure; do
KUBECONFIG=./kubeconfig.${cloud}.yaml \
kubectl -n secrets-demo get secret provider-test-secret \
-o jsonpath='{.data.message}' | base64 -d
echo
done
hello-from-aws
hello-from-gcp
hello-from-azure
That completes the secrets backend checkpoint. The platform now has a working provider-backed path from AWS Secrets Manager, Google Secret Manager, and Azure Key Vault into Kubernetes without committing secret values to Git.
The final validation pass still showed the provider-backed test secret synced in every cluster:
cloud store external secret kubernetes secret
aws aws-secrets-manager SecretSynced present
gcp gcp-secret-manager SecretSynced present
azure azure-key-vault SecretSynced present
Central metrics credentials #
The secrets path is useful on its own, but the real payoff is using it for platform credentials. The observability checkpoint left each cluster with its own local Prometheus and Grafana. That proved collection and dashboard provisioning, but it did not answer the cross-cluster question. To see the platform as one system, the three Prometheus instances need to send metrics to a shared backend. For this exercise I used Grafana Cloud as that backend. Each cluster still runs its local Prometheus. The difference is that Prometheus now remote-writes the same metrics to Grafana Cloud with external labels attached:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cloud: aws
cluster: trinity-dev-aws
platform: trinity
remote_write:
- url: __GRAFANA_CLOUD_REMOTE_WRITE_URL__
basic_auth:
username_file: /etc/grafana-cloud/username
password_file: /etc/grafana-cloud/password
The labels are the important part for the shared view. Grafana Cloud receives samples from AWS, GCP, and Azure into one metrics backend, and the queries can group or filter by cloud, cluster, and platform.
The credentials are deliberately not stored in the Prometheus ConfigMap. Grafana Cloud gives three pieces of information for Prometheus remote write:
- remote-write URL
- Prometheus username or instance ID
- access policy token with
metrics:write
Those values are set as Pulumi secret config for each cloud stack:
for cloud in aws gcp azure; do
pulumi -C infra/pulumi/${cloud} config set --secret trinity:grafanaCloudRemoteWriteUrl "https://<grafana-cloud-prometheus-host>/api/prom/push"
pulumi -C infra/pulumi/${cloud} config set --secret trinity:grafanaCloudPrometheusUsername "<prometheus-user-id>"
pulumi -C infra/pulumi/${cloud} config set --secret trinity:grafanaCloudPrometheusPassword "<grafana-cloud-access-policy-token>"
done
Pulumi then writes them into the cloud secret backend for that cluster:
trinity-dev-aws-grafana-cloud-remote-write-url
trinity-dev-aws-grafana-cloud-remote-write-username
trinity-dev-aws-grafana-cloud-remote-write-password
The same naming pattern exists for GCP and Azure. External Secrets syncs those backend values into the observability namespace as one Kubernetes Secret:
grafana-cloud-remote-write
Prometheus reads the username and password from mounted files. The URL needs a small extra step because Prometheus supports username_file and password_file, but not a matching url_file. The deployment uses a tiny init container to render the final prometheus.yml from a checked-in template and the synced URL:
remote_write_url="$(cat /etc/grafana-cloud/url)"
sed "s#__GRAFANA_CLOUD_REMOTE_WRITE_URL__#${remote_write_url}#g" \
/etc/prometheus-template/prometheus.yml.template \
> /etc/prometheus-generated/prometheus.yml
That keeps the full remote-write configuration out of Git while still letting Argo CD own the workload shape.
The validation path has three layers. First, check that External Secrets has materialized the Grafana Cloud secret:
for cloud in aws gcp azure; do
KUBECONFIG=./kubeconfig.${cloud}.yaml \
kubectl -n observability get externalsecret grafana-cloud-remote-write
KUBECONFIG=./kubeconfig.${cloud}.yaml \
kubectl -n observability get secret grafana-cloud-remote-write
done
Then check that Prometheus has started with the rendered configuration:
KUBECONFIG=./kubeconfig.aws.yaml \
kubectl -n observability logs deployment/prometheus -c render-prometheus-config
KUBECONFIG=./kubeconfig.aws.yaml \
kubectl -n observability logs deployment/prometheus -c prometheus --tail=80
Finally, generate Mandelbrot renders and query Grafana Cloud. These are the first useful cross-cluster expressions:
mandelbrot_render_requests_total{platform="trinity"}
sum by (cloud, cluster) (
mandelbrot_render_requests_total{platform="trinity"}
)
sum by (cloud, cluster) (
rate(mandelbrot_stage_renders_total{platform="trinity"}[5m])
)
The combined Grafana Cloud view now shows metrics from all three clusters in one place. The local Grafana instances are still useful for cluster-local inspection, but Grafana Cloud is the platform view. The final reconciliation pass found two practical issues.
First, External Secrets can fail for different reasons that look similar from the app list. A missing ClusterSecretStore was just sync ordering while the operator and store were still being applied. A Secret does not exist error after the store was ready meant the cloud backend value was actually missing. Checking the ExternalSecret events and the cloud secret versions was the fastest way to separate those cases.
Second, the original GCP cluster was too small for the full platform. With Argo CD, External Secrets, observability, and Mandelbrot all running, Prometheus stayed pending with Insufficient cpu on the single GKE node. The follow-up infrastructure fix is to raise the GCP node pool to two nodes. That change is later than the first secrets branch snapshot, but it is the practical fix for the run described here. After that, AWS, GCP, and Azure all settled with the same shape: the secret apps healthy, the synced Kubernetes secrets present, and every observability pod running.
This closes the centralized metrics part of the observability goal. Before adding more observability signals, the platform needs one more boundary: admission control. The next checkpoint is about what the cluster should reject before a workload is allowed to run.
Lifecycle boundary #
There was one important correction after the first end-to-end run. This landed later than the first secrets branch snapshot, but it belongs with the secrets checkpoint because it changes the operational boundary. I initially treated secret backend creation as part of the normal infrastructure deploy. That works until a backend object is deleted.
Azure Key Vault soft-delete keeps the vault name reserved, and AWS Secrets Manager keeps a deleted secret name in a scheduled-deletion state. Re-running the normal CI deployment then fails with "already exists in deleted state" or "scheduled for deletion" errors.
The fix was to make secret backend lifecycle explicit. Normal cluster deployment now grants access to existing backend names, but it does not purge or recreate tombstoned secrets as a side effect. A separate manually triggered GitHub Actions workflow manages the secret backends:
Secret Backends
apply -> create/update backend containers and secret versions
delete -> remove them only with an explicit confirmation input
That keeps routine deploys from doing destructive secret cleanup while still making the secret lifecycle repeatable from CI. It also made a real bug visible in the helper script: pulumi config get --show-secrets is not a valid command for the Pulumi CLI version I was using. The script now reads stack config with pulumi config --show-secrets --json and parses the values from JSON. Before that fix, GCP Secret Manager had secret containers but no enabled versions, so External Secrets reported SecretSyncedError even though the names existed.
That lesson is bigger than this demo secret. Secrets have lifecycle semantics that are different from ordinary infrastructure. A deleted secret may still have recovery behavior, reserved names, disabled versions, or purge windows. The normal deploy path should grant access to the current backend objects and sync the Kubernetes shape. Creating, deleting, and purging backend secrets deserves a separate, explicit operational path.