ProficientNowTechRFCs

3. System Architecture

RFC-DEPLOY-0001                                              Section 3
Category: Standards Track                          System Architecture

3. System Architecture

← Previous: Requirements | Index | Next: Components →


This section defines the macro-architecture of the system: layers, phases, authority domains, and trust boundaries. No component-level detail is described here.


3.1 Architectural Overview

The deployment orchestration system is designed as a layered, phase-driven architecture.

Rather than treating deployment as a single operation, the architecture explicitly models:

  • which system is authoritative at each point in the lifecycle,
  • what transitions occur between phases,
  • and how authority is handed off between layers.

This avoids the most common failure mode in deployment systems:

assuming all components exist and are healthy during initial deployment.

At a high level, the platform consists of four conceptual layers:

┌─────────────────────────────────────────────────────────────────────────────┐
│                       BOOTSTRAP LAYER                                        │
│  Imperative setup of cluster prerequisites and orchestration primitives     │
│  Authority: Day 0-1 operations                                              │
│  Technology recommendation: Ansible with kubernetes.core collection         │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATION LAYER                                       │
│  DAG-based dependency resolution and cross-application sequencing           │
│  Authority: Deployment ordering and health-gated progression                │
│  Technology: Argo Workflows                                                 │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│                     DEPLOYMENT LAYER                                         │
│  Declarative application state management and continuous reconciliation     │
│  Authority: Application-level sync, drift detection, self-healing           │
│  Technology: ArgoCD                                                         │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│                      PROMOTION LAYER                                         │
│  Environment-to-environment promotion with verification gates               │
│  Authority: Stage progression and verification                              │
│  Technology: Kargo, Argo Rollouts                                           │
└─────────────────────────────────────────────────────────────────────────────┘

Each layer activates in sequence, and authority transfers explicitly at defined boundaries.


3.2 Phase Model Overview

The system progresses through five distinct phases.

These phases are logical, not temporal guarantees — the system MAY remain in any phase indefinitely during failure or recovery.


Phase 0 — Pre-Bootstrap (Prerequisites)

Phase 0 defines the exact contract that MUST be satisfied before bootstrap can begin. These are hard requirements, not recommendations.


3.2.1 Required CLI Tools

The following tools MUST be installed and accessible on the execution host:

ToolMinimum VersionPurposeVerification Command
ansible-playbook2.15+Playbook executionansible-playbook --version
kubectl1.28+Kubernetes API accesskubectl version --client
helm3.12+Chart installationhelm version --short
kubeseal0.24+Secret encryptionkubeseal --version
jq1.6+JSON processingjq --version
yq4.0+YAML processingyq --version

These tools MUST be present. The executor MUST fail with exit code 3 (Invalid Configuration) if any required tool is missing.


3.2.2 Required Python Dependencies

Ansible requires the following Python packages:

PackagePurpose
kuberneteskubernetes.core collection
openshiftOpenShift integration
pyyamlYAML parsing
jsonpatchJSON patch operations

These MUST be installed in the Ansible execution environment.


3.2.3 Required Ansible Collections

The following Ansible collections MUST be installed:

CollectionMinimum VersionPurpose
kubernetes.core4.0.0Kubernetes module support
community.general4.0.0General modules
ansible.posix1.0.0POSIX operations
ansible.utils5.0.0Utility filters

3.2.4 Required Cluster State

The target Kubernetes cluster MUST satisfy:

RequirementVerificationFailure Mode
API server accessiblekubectl cluster-info returns without errorCannot proceed
Authentication validkubectl auth can-i create namespace returns yesCannot proceed
cluster-admin or equivalentRBAC check for required permissionsCannot proceed
Container runtime operationalkubectl get nodes shows Ready nodesCannot proceed
DNS resolution workingCoreDNS pods running in kube-systemCannot proceed
At least 3 worker nodeskubectl get nodes countWarning only
Sufficient resourcesNode capacity checkWarning only

3.2.5 Required External Connectivity

The cluster MUST be able to reach:

EndpointPurposeProtocol
Container registriesImage pullHTTPS (443)
Git repositoryManifest sourceHTTPS (443)
Helm repositoriesChart downloadHTTPS (443)

Air-gapped deployments MUST pre-populate an internal registry and configure appropriate mirror settings.


3.2.6 Required Credentials

The following credentials MUST be available:

CredentialFormatSource
KubeconfigFile or base64-encodedMounted or environment
Git repository accessSSH key or tokenMounted secret
Container registry authDocker config JSONMounted secret
DNS provider credentialsProvider-specificMounted secret

3.2.7 What Does NOT Exist

At Phase 0, the following MUST NOT exist (or MUST be absent for clean install):

  • ArgoCD namespace and resources
  • Argo Workflows namespace and resources
  • Argo Events namespace and resources
  • Platform application namespaces
  • Sealed-secrets controller (unless pre-existing key must be preserved)

If these exist from a previous installation, teardown MUST be executed first.


3.2.8 Phase 0 Outputs

Phase 0 produces:

OutputDescription
Validation reportPass/fail status for each prerequisite
Cluster inventoryNode count, versions, capacity
Connectivity reportExternal endpoint reachability
Credential verificationAuthentication success confirmation

Phase 0 MUST NOT modify any cluster state. It is purely diagnostic.


Phase 1 — Bootstrap (Day 0-1)

What exists

  • Kubernetes cluster
  • Bootstrap credentials and configuration

What is created

  • ArgoCD and required CRDs
  • Argo Workflows and ClusterWorkflowTemplates
  • Argo Events (EventBus, EventSources, Sensors)
  • Automation credentials for cross-system integration
  • Root Application defining GitOps handoff point

Authority

Bootstrap layer (Ansible-recommended)

Purpose

  • Establish orchestration primitives
  • Create foundation for declarative management
  • Define the point where GitOps takes over

Phase 2 — Orchestration (DAG Execution)

What exists

  • All orchestration primitives from Phase 1
  • Dependency DAG specification

What is created

  • Platform stacks deployed in dependency order
  • All ArgoCD Applications created and synced
  • Health verification completed for each node

Authority

Argo Workflows

Purpose

  • Execute deployment DAG in topological order
  • Gate progression on health verification
  • Handle transient failures with retry logic

Phase 3 — Steady-State (Day 2+)

What exists

  • All platform applications deployed and healthy
  • ArgoCD managing continuous reconciliation
  • Promotion pipelines operational

What happens

  • ArgoCD detects and corrects drift
  • Kargo promotes changes through environments
  • Argo Rollouts manages progressive delivery
  • Argo Events triggers automation workflows

Authority

  • ArgoCD for application state
  • Kargo for promotion
  • Argo Rollouts for delivery strategies

Purpose

  • Maintain platform health without human intervention
  • Enable GitOps-driven changes
  • Support ongoing operations

Phase 4 — Teardown (Optional)

What exists

  • Running platform from Phase 3

What is removed

  • Applications in reverse dependency order
  • Orchestration primitives
  • Platform namespaces and resources

Authority

Argo Workflows (or bootstrap layer for final cleanup)

Purpose

  • Gracefully remove platform components
  • Ensure no orphaned resources
  • Return cluster to clean state

3.3 Authority Domains

A core design principle is that authority MUST be singular and explicit at every phase.

At no point MAY two systems be equally authoritative for the same concern.


Authority by Layer

LayerAuthority Domain
BootstrapCluster prerequisites, orchestration primitive install
OrchestrationCross-application dependency resolution, DAG execution
DeploymentApplication sync, drift detection, self-healing
PromotionEnvironment progression, verification, delivery strategy

Authority by Component

ComponentIs Authoritative ForIs NOT Authoritative For
AnsibleDay 0-1 setup, ArgoCD installationApplication state, ordering
Argo WorkflowsDeployment sequence, health gatesApplication configuration
ArgoCDApplication state reconciliationCross-app ordering
KargoPromotion between environmentsApplication deployment
Argo RolloutsProgressive delivery strategiesDeployment ordering
Argo EventsEvent-driven triggersDeployment logic

Authority Transitions

Authority transfers occur at explicit handoff points:

  1. Bootstrap → Orchestration When ArgoCD and Argo Workflows are operational, bootstrap layer completes and triggers orchestration workflow.

  2. Orchestration → Deployment When DAG execution completes successfully, Argo Workflows yields authority to ArgoCD for ongoing reconciliation.

  3. Deployment → Promotion When applications are stable, Kargo takes authority for promoting changes across environments.


3.4 Trust Boundaries

The architecture defines explicit trust boundaries between systems.


Bootstrap Trust Boundary

┌─────────────────────────────────────────────────────────────────────────────┐
│                          EXTERNAL TRUST ZONE                                 │
│  - Operator workstation or automation system                                │
│  - Configuration repository                                                 │
│  - Container registry                                                       │
└─────────────────────────────────────────────────────────────────────────────┘

                           kubeconfig, credentials

┌─────────────────────────────────────────────────────────────────────────────┐
│                          CLUSTER TRUST ZONE                                  │
│  - Kubernetes API                                                           │
│  - Bootstrap controller                                                     │
└─────────────────────────────────────────────────────────────────────────────┘

The bootstrap layer bridges external and cluster trust zones using credentials that MUST be rotated after initial setup.


Orchestration Trust Boundary

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ORCHESTRATION TRUST ZONE                                  │
│  - Argo Workflows controller                                                │
│  - ClusterWorkflowTemplates                                                 │
│  - Automation service accounts                                              │
└─────────────────────────────────────────────────────────────────────────────┘

                           ArgoCD API, health queries

┌─────────────────────────────────────────────────────────────────────────────┐
│                     DEPLOYMENT TRUST ZONE                                    │
│  - ArgoCD Application Controller                                            │
│  - Application resources                                                    │
│  - Kubernetes API                                                           │
└─────────────────────────────────────────────────────────────────────────────┘

Argo Workflows communicates with ArgoCD through its API, not by directly manipulating Kubernetes resources managed by ArgoCD.


External Trust Boundary

┌─────────────────────────────────────────────────────────────────────────────┐
│                         EXTERNAL SYSTEMS                                     │
│  - Git repositories (source of intent)                                      │
│  - Container registries (image artifacts)                                   │
│  - Secret stores (external credentials)                                     │
└─────────────────────────────────────────────────────────────────────────────┘

                           pull only, validated sources

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CLUSTER SYSTEMS                                      │
│  - ArgoCD (consumes Git)                                                    │
│  - Kubernetes (consumes images)                                             │
│  - External Secrets Operator (consumes secrets)                             │
└─────────────────────────────────────────────────────────────────────────────┘

External systems provide inputs to the cluster. The cluster MUST NOT have write access to external sources of truth.


3.5 High-Level Control Flow

The system operates as a one-way progression:

  1. Bootstrap establishes primitives
  2. Orchestration executes DAG
  3. Deployment maintains state
  4. Promotion advances environments

High-Level Sequence


3.6 Failure Domains and Recovery

The phase model enables bounded failure domains.


Bootstrap Failure

  • Impact: Orchestration cannot start
  • Recovery: Re-run bootstrap with same inputs (idempotent)
  • Steady-state impact: None (not yet reached)

Orchestration Failure

  • Impact: Dependent applications not deployed
  • Recovery: Resume workflow from failure point
  • Steady-state impact: Partial platform, explicitly known

Deployment Failure (Steady-State)

  • Impact: Individual application degraded
  • Recovery: ArgoCD self-healing or manual sync
  • Platform impact: Isolated to failing application

Promotion Failure

  • Impact: Environment progression blocked
  • Recovery: Re-run promotion or manual intervention
  • Platform impact: Current environment unaffected

Cluster Loss

  • Recovery from:
    • Git repository (source of intent)
    • Container registry (application images)
    • External secret store (credentials)
  • Process: Re-execute from Phase 0

3.7 Runtime Dependency Failure Semantics

This section addresses a critical operational question:

If Application B depends on Application A, and A becomes unhealthy after B was successfully deployed, what happens to B?


3.7.1 The Answer: Nothing Automatic

Application B is NOT automatically removed, deleted, or modified.

Dependencies are enforced at deployment time, not at runtime.

The orchestration system guarantees:

  • B will not begin deployment until A is Healthy
  • B will not complete deployment until B itself is Healthy

The orchestration system does NOT guarantee:

  • B will remain healthy if A becomes unhealthy
  • B will be removed if A is removed
  • B will be notified if A degrades

3.7.2 Rationale

This is an intentional design choice:

  1. Cascading deletions are dangerous Automatically deleting B because A is unhealthy could cascade through the entire platform, causing total outage from a single transient failure.

  2. Applications should be resilient Well-designed applications handle dependency unavailability gracefully (retry, circuit breaker, degraded mode). Forcing deletion removes this possibility.

  3. Recovery is easier with B still deployed When A recovers, B can immediately resume normal operation. If B were deleted, it would need to be re-deployed.

  4. Operational visibility is preserved Operators can see that B is unhealthy because A is unavailable. Automatic deletion would hide this relationship.


3.7.3 What DOES Happen

When A becomes unhealthy:

ComponentBehavior
Application BContinues running, may become unhealthy due to dependency
ArgoCDReports A as Degraded, B health depends on its own probes
PrometheusAlerts fire for A's degraded status
OperatorsNotified of A's failure, investigate root cause

3.7.4 Explicit Teardown for Dependency Removal

If A must be removed:

  1. Teardown workflow executes in reverse dependency order
  2. B is deleted first (because B depends on A)
  3. A is deleted only after B is confirmed removed
  4. This is explicit, not automatic

3.7.5 Steady-State Health Monitoring

During steady-state (Phase 3), health is monitored but not enforced:

Dependency Enforcement:  DEPLOYMENT-TIME ONLY
Health Monitoring:       CONTINUOUS
Automatic Remediation:   NONE (except ArgoCD drift correction)

This is consistent with Kubernetes' own behavior: a Pod is not deleted because a Service it depends on becomes unavailable.


3.8 Bootstrap Timing Analysis

This section addresses deployment duration and scaling characteristics.


3.8.1 Platform Bootstrap Duration

For the current platform (43 applications across 7 dependency layers):

PhaseDuration RangePrimary Bottleneck
Phase 0 (Pre-Bootstrap)~30 secondsConnectivity checks
Phase 1 (Bootstrap)5-10 minutesArgoCD pod readiness
Phase 2 (Orchestration)30-60 minutesCeph cluster formation
Total35-70 minutesStorage initialization

3.8.2 Layer-by-Layer Timing

LayerApplicationsTypical DurationParallelism
0: Foundation3 (MetalLB, cert-manager, sealed-secrets)2-3 minFull
1: Storage5 (Ceph op, cluster, PG op, Redis op)15-25 minLimited
2: Core Infrastructure2 (ingress, external-dns)2-3 minFull
3: Security5 (Vault, ESO, Keycloak, Kyverno)5-8 minPartial
4: Data3 (PG clusters, Redis clusters)5-10 minPartial
5: Platform Services4 (Prometheus, Grafana, Loki, Tempo)5-8 minPartial
6: Developer Platform5 (Harbor, Backstage, Tekton)5-10 minPartial
7: Application Infrastructure3 (Temporal, Kargo, Rollouts)3-5 minPartial

Critical path: Layer 1 (Ceph) dominates total duration.


3.8.3 Scaling Characteristics

Linear scaling with caveat:

Application CountEstimated DurationNotes
43 (current)35-70 minProduction baseline
6045-85 min+30% if new apps in existing layers
10060-120 min+70% if new layers added
100+90-180 minDepends on dependency depth

Why not worse?

  • Parallelism within layers bounds per-layer time
  • Only sequential across layers, not within
  • Adding apps to existing layers adds minimal time
  • Adding new dependency layers adds significant time

3.8.4 Timing Optimization Strategies

What reduces duration:

  • Pre-pulling images to nodes
  • SSD-backed storage for Ceph OSD formation
  • Increased Argo Workflows parallelism
  • Larger node instance types

What does NOT reduce duration:

  • More applications in parallel (already maximized)
  • More Argo Workflows controller replicas
  • Skipping health checks (violates invariants)

3.8.5 Timeout Configuration

Default timeouts per application type:

Application TypeSync TimeoutHealth Timeout
Simple deployment5 min5 min
Operator installation10 min10 min
Ceph cluster30 min30 min
Database cluster15 min15 min
Vault (with unseal)15 min15 min

These are configurable per-node in the DAG specification.


3.9 Environment Scope

3.9.1 Single Production Environment

This RFC addresses single-cluster, single-environment deployment.

The platform deploys to production only. There is no:

  • Development environment promotion
  • Staging environment progression
  • Multi-environment Kargo pipelines

3.9.2 Implications

Kargo Usage: Limited to application workload promotion, not platform infrastructure promotion.

Promotion Layer: Effectively unused for platform deployment. The four-layer architecture simplifies to three active layers during platform bootstrap:

  • Bootstrap Layer
  • Orchestration Layer
  • Deployment Layer

3.9.3 Multi-Environment Deferral

Multi-environment platform deployment is explicitly deferred to Section 9: Future Considerations.

When multi-environment is implemented:

  • Each environment will have its own cluster
  • Platform deployment will execute independently per cluster
  • No cross-cluster orchestration in initial design

3.10 Summary

This high-level architecture establishes:

  • A clear four-layer model with explicit responsibilities
  • Five deployment phases with defined transitions
  • Singular authority at every point
  • Explicit trust boundaries between systems
  • Bounded failure domains with recovery paths

It intentionally delays component-level detail until the system's shape and rules are unambiguous.


Document Navigation


End of Section 3

On this page