3. System Architecture
3. System Architecture
← Previous: Requirements | Index | Next: Components →
This section defines the macro-architecture of the system: layers, phases, authority domains, and trust boundaries. No component-level detail is described here.
3.1 Architectural Overview
The deployment orchestration system is designed as a layered, phase-driven architecture.
Rather than treating deployment as a single operation, the architecture explicitly models:
- which system is authoritative at each point in the lifecycle,
- what transitions occur between phases,
- and how authority is handed off between layers.
This avoids the most common failure mode in deployment systems:
assuming all components exist and are healthy during initial deployment.
At a high level, the platform consists of four conceptual layers:
Each layer activates in sequence, and authority transfers explicitly at defined boundaries.
3.2 Phase Model Overview
The system progresses through five distinct phases.
These phases are logical, not temporal guarantees — the system MAY remain in any phase indefinitely during failure or recovery.
Phase 0 — Pre-Bootstrap (Prerequisites)
Phase 0 defines the exact contract that MUST be satisfied before bootstrap can begin. These are hard requirements, not recommendations.
3.2.1 Required CLI Tools
The following tools MUST be installed and accessible on the execution host:
| Tool | Minimum Version | Purpose | Verification Command |
|---|---|---|---|
ansible-playbook | 2.15+ | Playbook execution | ansible-playbook --version |
kubectl | 1.28+ | Kubernetes API access | kubectl version --client |
helm | 3.12+ | Chart installation | helm version --short |
kubeseal | 0.24+ | Secret encryption | kubeseal --version |
jq | 1.6+ | JSON processing | jq --version |
yq | 4.0+ | YAML processing | yq --version |
These tools MUST be present. The executor MUST fail with exit code 3 (Invalid Configuration) if any required tool is missing.
3.2.2 Required Python Dependencies
Ansible requires the following Python packages:
| Package | Purpose |
|---|---|
kubernetes | kubernetes.core collection |
openshift | OpenShift integration |
pyyaml | YAML parsing |
jsonpatch | JSON patch operations |
These MUST be installed in the Ansible execution environment.
3.2.3 Required Ansible Collections
The following Ansible collections MUST be installed:
| Collection | Minimum Version | Purpose |
|---|---|---|
kubernetes.core | 4.0.0 | Kubernetes module support |
community.general | 4.0.0 | General modules |
ansible.posix | 1.0.0 | POSIX operations |
ansible.utils | 5.0.0 | Utility filters |
3.2.4 Required Cluster State
The target Kubernetes cluster MUST satisfy:
| Requirement | Verification | Failure Mode |
|---|---|---|
| API server accessible | kubectl cluster-info returns without error | Cannot proceed |
| Authentication valid | kubectl auth can-i create namespace returns yes | Cannot proceed |
| cluster-admin or equivalent | RBAC check for required permissions | Cannot proceed |
| Container runtime operational | kubectl get nodes shows Ready nodes | Cannot proceed |
| DNS resolution working | CoreDNS pods running in kube-system | Cannot proceed |
| At least 3 worker nodes | kubectl get nodes count | Warning only |
| Sufficient resources | Node capacity check | Warning only |
3.2.5 Required External Connectivity
The cluster MUST be able to reach:
| Endpoint | Purpose | Protocol |
|---|---|---|
| Container registries | Image pull | HTTPS (443) |
| Git repository | Manifest source | HTTPS (443) |
| Helm repositories | Chart download | HTTPS (443) |
Air-gapped deployments MUST pre-populate an internal registry and configure appropriate mirror settings.
3.2.6 Required Credentials
The following credentials MUST be available:
| Credential | Format | Source |
|---|---|---|
| Kubeconfig | File or base64-encoded | Mounted or environment |
| Git repository access | SSH key or token | Mounted secret |
| Container registry auth | Docker config JSON | Mounted secret |
| DNS provider credentials | Provider-specific | Mounted secret |
3.2.7 What Does NOT Exist
At Phase 0, the following MUST NOT exist (or MUST be absent for clean install):
- ArgoCD namespace and resources
- Argo Workflows namespace and resources
- Argo Events namespace and resources
- Platform application namespaces
- Sealed-secrets controller (unless pre-existing key must be preserved)
If these exist from a previous installation, teardown MUST be executed first.
3.2.8 Phase 0 Outputs
Phase 0 produces:
| Output | Description |
|---|---|
| Validation report | Pass/fail status for each prerequisite |
| Cluster inventory | Node count, versions, capacity |
| Connectivity report | External endpoint reachability |
| Credential verification | Authentication success confirmation |
Phase 0 MUST NOT modify any cluster state. It is purely diagnostic.
Phase 1 — Bootstrap (Day 0-1)
What exists
- Kubernetes cluster
- Bootstrap credentials and configuration
What is created
- ArgoCD and required CRDs
- Argo Workflows and ClusterWorkflowTemplates
- Argo Events (EventBus, EventSources, Sensors)
- Automation credentials for cross-system integration
- Root Application defining GitOps handoff point
Authority
Bootstrap layer (Ansible-recommended)
Purpose
- Establish orchestration primitives
- Create foundation for declarative management
- Define the point where GitOps takes over
Phase 2 — Orchestration (DAG Execution)
What exists
- All orchestration primitives from Phase 1
- Dependency DAG specification
What is created
- Platform stacks deployed in dependency order
- All ArgoCD Applications created and synced
- Health verification completed for each node
Authority
Argo Workflows
Purpose
- Execute deployment DAG in topological order
- Gate progression on health verification
- Handle transient failures with retry logic
Phase 3 — Steady-State (Day 2+)
What exists
- All platform applications deployed and healthy
- ArgoCD managing continuous reconciliation
- Promotion pipelines operational
What happens
- ArgoCD detects and corrects drift
- Kargo promotes changes through environments
- Argo Rollouts manages progressive delivery
- Argo Events triggers automation workflows
Authority
- ArgoCD for application state
- Kargo for promotion
- Argo Rollouts for delivery strategies
Purpose
- Maintain platform health without human intervention
- Enable GitOps-driven changes
- Support ongoing operations
Phase 4 — Teardown (Optional)
What exists
- Running platform from Phase 3
What is removed
- Applications in reverse dependency order
- Orchestration primitives
- Platform namespaces and resources
Authority
Argo Workflows (or bootstrap layer for final cleanup)
Purpose
- Gracefully remove platform components
- Ensure no orphaned resources
- Return cluster to clean state
3.3 Authority Domains
A core design principle is that authority MUST be singular and explicit at every phase.
At no point MAY two systems be equally authoritative for the same concern.
Authority by Layer
| Layer | Authority Domain |
|---|---|
| Bootstrap | Cluster prerequisites, orchestration primitive install |
| Orchestration | Cross-application dependency resolution, DAG execution |
| Deployment | Application sync, drift detection, self-healing |
| Promotion | Environment progression, verification, delivery strategy |
Authority by Component
| Component | Is Authoritative For | Is NOT Authoritative For |
|---|---|---|
| Ansible | Day 0-1 setup, ArgoCD installation | Application state, ordering |
| Argo Workflows | Deployment sequence, health gates | Application configuration |
| ArgoCD | Application state reconciliation | Cross-app ordering |
| Kargo | Promotion between environments | Application deployment |
| Argo Rollouts | Progressive delivery strategies | Deployment ordering |
| Argo Events | Event-driven triggers | Deployment logic |
Authority Transitions
Authority transfers occur at explicit handoff points:
-
Bootstrap → Orchestration When ArgoCD and Argo Workflows are operational, bootstrap layer completes and triggers orchestration workflow.
-
Orchestration → Deployment When DAG execution completes successfully, Argo Workflows yields authority to ArgoCD for ongoing reconciliation.
-
Deployment → Promotion When applications are stable, Kargo takes authority for promoting changes across environments.
3.4 Trust Boundaries
The architecture defines explicit trust boundaries between systems.
Bootstrap Trust Boundary
The bootstrap layer bridges external and cluster trust zones using credentials that MUST be rotated after initial setup.
Orchestration Trust Boundary
Argo Workflows communicates with ArgoCD through its API, not by directly manipulating Kubernetes resources managed by ArgoCD.
External Trust Boundary
External systems provide inputs to the cluster. The cluster MUST NOT have write access to external sources of truth.
3.5 High-Level Control Flow
The system operates as a one-way progression:
- Bootstrap establishes primitives
- Orchestration executes DAG
- Deployment maintains state
- Promotion advances environments
High-Level Sequence
3.6 Failure Domains and Recovery
The phase model enables bounded failure domains.
Bootstrap Failure
- Impact: Orchestration cannot start
- Recovery: Re-run bootstrap with same inputs (idempotent)
- Steady-state impact: None (not yet reached)
Orchestration Failure
- Impact: Dependent applications not deployed
- Recovery: Resume workflow from failure point
- Steady-state impact: Partial platform, explicitly known
Deployment Failure (Steady-State)
- Impact: Individual application degraded
- Recovery: ArgoCD self-healing or manual sync
- Platform impact: Isolated to failing application
Promotion Failure
- Impact: Environment progression blocked
- Recovery: Re-run promotion or manual intervention
- Platform impact: Current environment unaffected
Cluster Loss
- Recovery from:
- Git repository (source of intent)
- Container registry (application images)
- External secret store (credentials)
- Process: Re-execute from Phase 0
3.7 Runtime Dependency Failure Semantics
This section addresses a critical operational question:
If Application B depends on Application A, and A becomes unhealthy after B was successfully deployed, what happens to B?
3.7.1 The Answer: Nothing Automatic
Application B is NOT automatically removed, deleted, or modified.
Dependencies are enforced at deployment time, not at runtime.
The orchestration system guarantees:
- B will not begin deployment until A is Healthy
- B will not complete deployment until B itself is Healthy
The orchestration system does NOT guarantee:
- B will remain healthy if A becomes unhealthy
- B will be removed if A is removed
- B will be notified if A degrades
3.7.2 Rationale
This is an intentional design choice:
-
Cascading deletions are dangerous Automatically deleting B because A is unhealthy could cascade through the entire platform, causing total outage from a single transient failure.
-
Applications should be resilient Well-designed applications handle dependency unavailability gracefully (retry, circuit breaker, degraded mode). Forcing deletion removes this possibility.
-
Recovery is easier with B still deployed When A recovers, B can immediately resume normal operation. If B were deleted, it would need to be re-deployed.
-
Operational visibility is preserved Operators can see that B is unhealthy because A is unavailable. Automatic deletion would hide this relationship.
3.7.3 What DOES Happen
When A becomes unhealthy:
| Component | Behavior |
|---|---|
| Application B | Continues running, may become unhealthy due to dependency |
| ArgoCD | Reports A as Degraded, B health depends on its own probes |
| Prometheus | Alerts fire for A's degraded status |
| Operators | Notified of A's failure, investigate root cause |
3.7.4 Explicit Teardown for Dependency Removal
If A must be removed:
- Teardown workflow executes in reverse dependency order
- B is deleted first (because B depends on A)
- A is deleted only after B is confirmed removed
- This is explicit, not automatic
3.7.5 Steady-State Health Monitoring
During steady-state (Phase 3), health is monitored but not enforced:
This is consistent with Kubernetes' own behavior: a Pod is not deleted because a Service it depends on becomes unavailable.
3.8 Bootstrap Timing Analysis
This section addresses deployment duration and scaling characteristics.
3.8.1 Platform Bootstrap Duration
For the current platform (43 applications across 7 dependency layers):
| Phase | Duration Range | Primary Bottleneck |
|---|---|---|
| Phase 0 (Pre-Bootstrap) | ~30 seconds | Connectivity checks |
| Phase 1 (Bootstrap) | 5-10 minutes | ArgoCD pod readiness |
| Phase 2 (Orchestration) | 30-60 minutes | Ceph cluster formation |
| Total | 35-70 minutes | Storage initialization |
3.8.2 Layer-by-Layer Timing
| Layer | Applications | Typical Duration | Parallelism |
|---|---|---|---|
| 0: Foundation | 3 (MetalLB, cert-manager, sealed-secrets) | 2-3 min | Full |
| 1: Storage | 5 (Ceph op, cluster, PG op, Redis op) | 15-25 min | Limited |
| 2: Core Infrastructure | 2 (ingress, external-dns) | 2-3 min | Full |
| 3: Security | 5 (Vault, ESO, Keycloak, Kyverno) | 5-8 min | Partial |
| 4: Data | 3 (PG clusters, Redis clusters) | 5-10 min | Partial |
| 5: Platform Services | 4 (Prometheus, Grafana, Loki, Tempo) | 5-8 min | Partial |
| 6: Developer Platform | 5 (Harbor, Backstage, Tekton) | 5-10 min | Partial |
| 7: Application Infrastructure | 3 (Temporal, Kargo, Rollouts) | 3-5 min | Partial |
Critical path: Layer 1 (Ceph) dominates total duration.
3.8.3 Scaling Characteristics
Linear scaling with caveat:
| Application Count | Estimated Duration | Notes |
|---|---|---|
| 43 (current) | 35-70 min | Production baseline |
| 60 | 45-85 min | +30% if new apps in existing layers |
| 100 | 60-120 min | +70% if new layers added |
| 100+ | 90-180 min | Depends on dependency depth |
Why not worse?
- Parallelism within layers bounds per-layer time
- Only sequential across layers, not within
- Adding apps to existing layers adds minimal time
- Adding new dependency layers adds significant time
3.8.4 Timing Optimization Strategies
What reduces duration:
- Pre-pulling images to nodes
- SSD-backed storage for Ceph OSD formation
- Increased Argo Workflows parallelism
- Larger node instance types
What does NOT reduce duration:
- More applications in parallel (already maximized)
- More Argo Workflows controller replicas
- Skipping health checks (violates invariants)
3.8.5 Timeout Configuration
Default timeouts per application type:
| Application Type | Sync Timeout | Health Timeout |
|---|---|---|
| Simple deployment | 5 min | 5 min |
| Operator installation | 10 min | 10 min |
| Ceph cluster | 30 min | 30 min |
| Database cluster | 15 min | 15 min |
| Vault (with unseal) | 15 min | 15 min |
These are configurable per-node in the DAG specification.
3.9 Environment Scope
3.9.1 Single Production Environment
This RFC addresses single-cluster, single-environment deployment.
The platform deploys to production only. There is no:
- Development environment promotion
- Staging environment progression
- Multi-environment Kargo pipelines
3.9.2 Implications
Kargo Usage: Limited to application workload promotion, not platform infrastructure promotion.
Promotion Layer: Effectively unused for platform deployment. The four-layer architecture simplifies to three active layers during platform bootstrap:
- Bootstrap Layer
- Orchestration Layer
- Deployment Layer
3.9.3 Multi-Environment Deferral
Multi-environment platform deployment is explicitly deferred to Section 9: Future Considerations.
When multi-environment is implemented:
- Each environment will have its own cluster
- Platform deployment will execute independently per cluster
- No cross-cluster orchestration in initial design
3.10 Summary
This high-level architecture establishes:
- A clear four-layer model with explicit responsibilities
- Five deployment phases with defined transitions
- Singular authority at every point
- Explicit trust boundaries between systems
- Bounded failure domains with recovery paths
It intentionally delays component-level detail until the system's shape and rules are unambiguous.
Document Navigation
| Previous | Index | Next |
|---|---|---|
| ← 2. Requirements | Table of Contents | 4. Components → |
End of Section 3