Policies

Publish at:

Policy enforcement flow

Policy checkpoint #

The platform can now provision clusters, reconcile workloads, route traffic, sync secrets, and publish central metrics. That is enough to run the system, but it still leaves one important boundary open. Every manifest that reaches the Kubernetes API is trusted.

That is not a realistic platform shape. Application teams should get fast feedback when a workload misses basic operational fields or asks for unsafe privileges. The cluster should reject that workload before it runs.

This checkpoint adds that first admission boundary with Kyverno. Kyverno fits well because policies are Kubernetes resources. Argo CD can reconcile them from Git like the rest of the platform state.

The baseline is deliberately small:

  • require CPU and memory requests and limits
  • block privileged containers
  • block latest image tags

That gives the platform real enforcement without pretending this is a complete governance program.

GitOps ownership #

Each cluster gets two Argo CD applications:

trinity-policy-engine-aws
trinity-policies-aws
trinity-policy-engine-gcp
trinity-policies-gcp
trinity-policy-engine-azure
trinity-policies-azure

The split is intentional. The policy engine application installs Kyverno. The policies application applies the Trinity baseline policies.

The AWS policy engine application is representative:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: trinity-policy-engine-aws
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  project: trinity
  source:
    repoURL: https://kyverno.github.io/kyverno/
    chart: kyverno
    targetRevision: 3.6.1
    helm:
      releaseName: kyverno
  destination:
    server: https://kubernetes.default.svc
    namespace: kyverno
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ApplyOutOfSyncOnly=true
      - ServerSideApply=true

The new policy set requires resources, so the real manifest also pins resource requests and limits for the Kyverno controllers. The policy engine must meet the same operational standard it is about to enforce elsewhere.

The policies application syncs after the engine:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: trinity-policies-aws
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  source:
    repoURL: https://github.com/maxgherman/trinity.git
    targetRevision: main
    path: platform/policy/overlays/aws
  destination:
    server: https://kubernetes.default.svc
    namespace: kyverno

That gives the root application a clear order:

wave -1: Argo CD project
wave  0: application declarations
wave  1: policy engine
wave  2: policy resources

The Argo CD project also needed to allow the Kyverno chart repository and Kyverno cluster resources:

sourceRepos:
  - https://github.com/maxgherman/trinity.git
  - https://charts.external-secrets.io
  - https://kyverno.github.io/kyverno/
clusterResourceWhitelist:
  - group: kyverno.io
    kind: ClusterPolicy

Kyverno itself also installs cluster-scoped Kubernetes resources such as CRDs, cluster roles, cluster role bindings, and admission webhooks. Those were already part of the project allowance from the previous operator checkpoints.

Scope #

The first Trinity policy set is scoped to workload namespaces:

hello
mandelbrot
observability
secrets-demo

That scope is a conscious tradeoff. It enforces real platform workloads while leaving Argo CD, Kyverno, and system namespaces outside the first blast radius.

Fail-closed admission policy is powerful. It can also make a cluster painful to recover if the first policy pass is too broad. For this checkpoint, workload namespaces are enough.

Baseline Policies #

The policies live under platform/policy:

platform/policy/
  base/
    disallow-latest-image-tag.yaml
    disallow-privileged-containers.yaml
    require-resources.yaml
  overlays/
    aws/
    gcp/
    azure/

The first policy requires CPU and memory requests and limits on both containers and init containers:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: trinity-require-container-resources
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: require-requests-and-limits
      match:
        any:
          - resources:
              kinds:
                - Pod
              namespaces:
                - hello
                - mandelbrot
                - observability
                - secrets-demo
      validate:
        message: All containers and init containers must declare CPU and memory requests and limits.
        pattern:
          spec:
            =(initContainers):
              - resources:
                  requests:
                    cpu: "?*"
                    memory: "?*"
                  limits:
                    cpu: "?*"
                    memory: "?*"
            containers:
              - resources:
                  requests:
                    cpu: "?*"
                    memory: "?*"
                  limits:
                    cpu: "?*"
                    memory: "?*"

The =(initContainers) syntax is important. It means "if init containers exist, validate them too." Without that conditional anchor, the policy would require every pod to have init containers.

The second policy rejects latest tags:

validate:
  message: Container images must use explicit non-latest tags.
  deny:
    conditions:
      any:
        - key: "{{ images.containers.*.tag || '' }}"
          operator: AnyIn
          value:
            - latest
        - key: "{{ images.initContainers.*.tag || '' }}"
          operator: AnyIn
          value:
            - latest

The third policy rejects privileged containers:

validate:
  message: Privileged containers are not allowed in Trinity workload namespaces.
  pattern:
    spec:
      =(initContainers):
        - =(securityContext):
            =(privileged): "false"
      containers:
        - =(securityContext):
            =(privileged): "false"

The current workloads already meet this baseline. The hello, mandelbrot, Prometheus, Grafana, and Prometheus config-render init containers all have explicit resources, pinned image tags, and no privileged mode.

CI check #

CI had to understand the new platform slice before Argo CD could rely on it. The manifest checker now allows ClusterPolicy:

const requiredKinds = new Set([
  "Application",
  "AppProject",
  "ConfigMap",
  "ClusterSecretStore",
  "ClusterPolicy",
  "Deployment",
  "ExternalSecret",
  "Kustomization",
  "Namespace",
  "Service",
]);

The workflows also render the policy overlays:

kubectl kustomize platform/policy/overlays/aws
kubectl kustomize platform/policy/overlays/gcp
kubectl kustomize platform/policy/overlays/azure

That catches simple failures before the policy controller sees them: invalid YAML, unsupported manifest kinds in the exercise checker, or a broken Kustomize overlay.

Validation #

The live validation path starts with the two Argo CD applications:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n argocd get application trinity-policy-engine-aws
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n argocd get application trinity-policies-aws
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n kyverno get deployment,pods
KUBECONFIG=./kubeconfig.aws.yaml kubectl get clusterpolicy

Then test the denial path with a pod that breaks the baseline:

for cloud in aws gcp azure; do
  KUBECONFIG=./kubeconfig.${cloud}.yaml kubectl -n hello run policy-denied \
    --image=nginx:latest \
    --restart=Never
done

All three clusters rejected the pod before creation. The denial named both policies that should catch it:

trinity-disallow-latest-image-tag:
  disallow-latest-image-tag: Container images must use explicit non-latest tags.
trinity-require-container-resources:
  require-requests-and-limits: All containers and init containers must declare CPU and memory requests and limits.

That is the useful platform behavior. The developer gets a direct admission error with the policy names and the remediation signal.

The final check showed the same three enforced cluster policies ready in AWS, GCP, and Azure:

policy                                   aws  gcp  azure
trinity-disallow-latest-image-tag        ok   ok   ok
trinity-disallow-privileged-containers   ok   ok   ok
trinity-require-container-resources      ok   ok   ok

Argo CD Drift #

One Argo CD issue showed up. Kyverno records generated policy state under ClusterPolicy.status, including pod-controller autogen rules and that is runtime controller state. Without an ignore rule, Argo CD can mark the policies application OutOfSync even though the policy spec is correct. The policies application now enables server-side diff and ignores /status for Kyverno ClusterPolicy resources:

metadata:
  annotations:
    argocd.argoproj.io/compare-options: ServerSideDiff=true
spec:
  ignoreDifferences:
    - group: kyverno.io
      kind: ClusterPolicy
      jsonPointers:
        - /status

That keeps Argo CD focused on the desired policy spec and leaves controller-owned status alone.

There was one app-of-apps operational note too. The AWS root picked up the policy applications after a hard refresh. GCP and Azure needed an explicit root sync before the new child applications appeared:

for cloud in gcp azure; do
  KUBECONFIG=./kubeconfig.${cloud}.yaml kubectl -n argocd patch application trinity-dev-${cloud}-root \
    --type merge \
    -p '{"operation":{"sync":{"syncStrategy":{"hook":{}}}}}'
done

That is worth remembering for this repo. The root application owns the child Application resources. If the Git commit is correct but a new child app does not appear, sync the root first.

Exit #

This closes the basic policy enforcement requirement. The platform now has a small admission baseline across AWS, GCP, and Azure. It is still intentionally narrow. A stronger version would add policy tests in CI, ownership label rules, namespace onboarding rules, policy reports, and a clearer emergency recovery runbook. With a basic admission boundary in place, the next checkpoint returns to observability. Metrics show the aggregate shape of the system, but they do not explain what happened to one slow or broken render. That needs logs and traces.

Source code #

Reference implementation (opens in a new tab)