The Trinity

It's been a while since I stretched my infra muscles. I was looking for some exercise and that got me thinking.
Application code is easy to revisit in small slices. Infrastructure is not. Leave Kubernetes, cloud networking, GitOps, observability, and delivery machinery alone for long enough, and the surface area moves under your feet. Managed services change, defaults change, and patterns that once felt sensible begin to look suspiciously dated.
I wanted an exercise with enough weight to be interesting and grounded in realistic platform work. One that reflects the kind of problem an experienced infrastructure engineer should be able to reason about: reproducibility, traffic management, policy, secrets, observability, and failure.
And so, I created such an exercise. The one that reflects all the points, yet gives enough flexibility for implementation.
Multi-Cloud Platform Engineering Exercise Spec #
1. Purpose #
This exercise is designed to test modern infrastructure and platform engineering skills at a level comparable to “Kubernetes across three cloud providers,” but in a way that is more realistic, more useful, and more aligned with how production platforms are commonly built.
Instead of stretching one Kubernetes cluster across multiple clouds, the candidate will build three separate clusters—one in AWS, one in GCP, and one in Azure—and operate them as a coherent platform using GitOps, infrastructure as code, policy controls, observability, and traffic failover.
This exercise is intended to evaluate practical judgment as much as technical execution.
2. Summary #
Design and implement a multi-cloud, multi-cluster application platform spanning:
- AWS: 1 Kubernetes cluster
- GCP: 1 Kubernetes cluster
- Azure: 1 Kubernetes cluster
A sample application must be deployed consistently to all three clusters. The platform must support:
- Declarative infrastructure provisioning
- GitOps-based application and platform delivery
- Centralized observability
- Secret management
- Global traffic routing and failover
- Progressive delivery or controlled rollout
- Basic policy enforcement
- Operational documentation
3. Scenario #
You are building the infrastructure platform for a fictional SaaS product with the following requirements:
- It serves users globally.
- It must continue operating if one cloud provider becomes unavailable.
- Platform changes must be auditable and reproducible.
- Application teams should be able to deploy safely without manually editing clusters.
- Operators need visibility across all environments.
- Secrets and configuration should be handled securely.
- The design should be extensible toward a future internal developer platform.
You must propose and implement a platform architecture that satisfies these goals.
4. What This Exercise Is Testing #
This exercise is intended to test:
- Infrastructure architecture judgment
- Kubernetes operations across multiple environments
- CI/CD and GitOps maturity
- Cloud networking fundamentals
- Traffic management and reliability design
- Secrets and identity strategy
- Observability design
- Policy and governance
- Failure handling and operational thinking
- Documentation quality
It is not intended to reward unnecessary complexity, novelty for its own sake, or brittle “hero architecture.”
5. Logical Architecture #
+----------------------+
| Git Repos |
| infra / platform / |
| application |
+----------+-----------+
|
v
+----------------------+
| CI / Validation |
| lint, preview, test, |
| policy, image build |
+----------+-----------+
|
v
+-----------------------+
| GitOps Control Layer |
| CD desired state |
| reconciliation |
+-----+---------------+-+
| |
------------+---------------+-------------+
| | |
v v v
+----------------+ +----------------+ +----------------+
| AWS / EKS | | GCP / GKE | | Azure / AKS |
| app + ingress | | app + ingress | | app + ingress |
| metrics/logs | | metrics/logs | | metrics/logs |
+-------+--------+ +-------+--------+ +-------+--------+
\ | /
\ | /
\ | /
+----------------------------------------+
| Global DNS / Traffic Steering / Health |
+----------------------------------------+
6. Mandatory Requirements #
The solution must include all of the following.
Infrastructure as Code #
All cloud infrastructure must be provisioned declaratively or through a reproducible infrastructure-as-code workflow.
Minimum scope:
- Kubernetes clusters
- Networking required for cluster access
- DNS or traffic infrastructure
- Secret backends or secret integration
- Observability infrastructure if applicable
Three Clusters #
Create one Kubernetes cluster in each of:
- AWS
- GCP
- Azure
Managed services are acceptable and encouraged:
- EKS
- GKE
- AKS
GitOps #
A GitOps operator must continuously reconcile at least:
- namespaces
- core platform components
- application manifests or Helm releases
Sample Application #
Deploy a non-trivial sample application to all clusters.
Minimum:
- frontend
- API
- health endpoints
Global Traffic Strategy #
Users must reach the application through a single public entry point.
The implementation must support one of:
- weighted routing
- latency-based routing
- active/passive failover
- health-based DNS failover
You must document:
- how routing decisions work
- how health is determined
- how failover is tested
Observability #
Provide centralized or federated observability across clusters.
Minimum:
- application metrics
- infrastructure metrics
- logs
- distributed tracing or trace-ready instrumentation
Secrets Management #
Secrets must not be hardcoded in manifests or repos.
You must explain:
- where secrets live
- how they are synced into clusters
- how rotation would work
Basic Policy Enforcement #
Implement at least two policy controls.
Examples:
- require CPU/memory requests and limits
- block privileged containers
- require approved image registries
- require labels/ownership metadata
- block latest image tags
Reliability Demonstration #
Demonstrate at least one of:
- traffic failover when one cluster is unavailable
- controlled rollout / canary
- rollback after a failed deployment
Documentation #
You must provide:
- architecture overview
- deployment instructions
- repo structure explanation
- operational runbook
- known tradeoffs
- future improvements
7. Nice-to-Have Requirements #
These are optional, but valuable.
- service mesh with mTLS
- workload identity / IRSA / federated identity
- SLOs and alerting
- per-team namespaces and RBAC model
- reusable Pulumi components
- reusable Helm chart or Kustomize base
- Backstage integration
- cost controls / autoscaling strategy
- chaos testing
- DR strategy notes
Solution #
Here is my final solution with all the findings and gotchas: