Deployment
Icepack is deployed via Terraform using a helm_release resource that manages
the charts/icepack/ Helm chart. All changes flow through terraform apply —
never run helm install, helm upgrade, or helm uninstall directly.
Deployment flow
Every deployment follows the same sequence:
- Modify the chart or application code under
charts/icepack/. - Bump the version in
charts/icepack/Chart.yaml. Terraform tracks the chart by path + version + values. If you change template files without bumping the version,terraform planwill show no diff and your changes will not deploy. - Build and push the Docker image to ECR (if application code changed).
- Plan and apply from the target environment directory:
cd terraform/icepack-api/env/<environment> # dev, stage, preprod, or prodterraform planterraform applyTerraform compares the new chart version and values against its state, generates a Helm upgrade under the hood, and records the result. The entire cycle takes roughly 60-90 seconds for a clean apply.
Contract or schema migrations
For changes that alter API contracts, cached payload tables, or scheduler decisioning, prefer a short maintenance window over mixed-version compatibility. Pause the orchestrator and health-sync CronJobs before the deploy; drain mode can be enabled as an additional guard against manual maintenance submissions, but it does not stop health-sync or other read-side cache writers. Bring the system back in a known-good order:
- Pause scheduler/orchestrator submissions and health-sync.
- Apply schema changes through the normal application startup path or explicit migration for the target release.
- Deploy API, UI, worker, and orchestrator changes together via Terraform.
- Resume health-sync and verify
table_status_snapshotsandtable_health_snapshotsare being populated. - Verify the split read contracts for a known table:
Terminal window curl -s "$ICEPACK_API_URL/tables/<database>/<table>/status" | jq keyscurl -s "$ICEPACK_API_URL/tables/<database>/<table>/health" | jq keyscurl -s "$ICEPACK_API_URL/tables/<database>/<table>/maintenance/recommendation" | jq keys - Resume scheduler/orchestrator runs.
Expected top-level keys:
| Endpoint | Keys |
|---|---|
/status | collection, collected_at, derived, files, last_error, manifests, partitions, snapshot, source, stale, table |
/health | assessed_at, collection, error, health_status, issues, status_collected_at, table |
/maintenance/recommendation | error, evaluated_at, evidence, history, policy, recommended_actions, skip_reasons, status_collected_at, table |
Environments
Icepack is deployed to four environments, each in its own AWS account and EKS cluster:
| Environment | Account | Terraform root | Values file |
|---|---|---|---|
| dev | 411706531007 | terraform/icepack-api/env/dev/ | charts/icepack/values-dev.yaml |
| stage | 174558682303 | terraform/icepack-api/env/stage/ | charts/icepack/values-stage.yaml |
| preprod | 391322830928 | terraform/icepack-api/env/preprod/ | charts/icepack/values-preprod.yaml |
| prod | 857967394368 | terraform/icepack-api/env/prod/ | charts/icepack/values-prod.yaml |
Each environment has its own IRSA role, Polaris service principal, Postgres
instance, and Secrets Manager entries. The CI/CD workflow
(.github/workflows/deploy-icepack.yml) supports all four environments via
workflow_dispatch with an environment selector; pushes to main auto-deploy
to dev only. Prod deploys require manual approval via the main-deploy
GitHub environment gate.
Why Terraform only
Terraform owns the Helm release via the helm_release resource. Running Helm
commands directly (even helm status) is fine for read-only inspection, but
any mutating Helm command creates state drift. After a direct
helm upgrade, the next terraform apply will see a version mismatch and
either fail or force a destructive re-deploy.
If you need to inspect what is deployed:
# Safe read-only Helm commandshelm list -n icepackhelm get values icepack -n icepackChart versioning rule
Bump version in charts/icepack/Chart.yaml whenever you modify any file
under charts/icepack/templates/. This is the only mechanism Terraform uses to
detect chart changes. A common mistake is modifying a template and forgetting
the version bump — terraform plan shows “No changes” and the new template
never deploys.
Infrastructure inventory
| Resource | Kind | Key settings |
|---|---|---|
| API | Deployment (2 replicas, arm64) | Port 8000; defaults 250m/256Mi requests and 500m/512Mi limits; dev overrides to 500m/1Gi requests and 1 CPU/8Gi limits |
| NLB Service | LoadBalancer | ACM TLS termination, external-dns hostname |
| Worker | KEDA ScaledJob | postgresql trigger, 30s polling, max 5 replicas (default) / 3 (dev, stage, preprod) |
| Orchestrator | CronJob | Hourly at :30 (dev, stage, preprod), every 2h (default chart), concurrencyPolicy: Forbid |
| Health Sync | CronJob | Every 15 minutes, concurrencyPolicy: Forbid |
| Postgres | Deployment (1 replica when postgres.internal.enabled=true) | 17-alpine, PVC-backed (gp3), Recreate strategy |
| IRSA ServiceAccount | ServiceAccount | eks.amazonaws.com/role-arn annotation, OIDC-bound |
All resources live in the icepack namespace. Nodes are selected by
kubernetes.io/arch: arm64.
Rollback tiers
Icepack has three rollback tiers depending on the severity of the issue.
Tier 1: Application bug
Flip the image tag back to the last known-good version and apply:
# In terraform.tfvars or the helm_release set blockimage_tag = "abc123-previous"terraform applyThis is the fastest rollback (~30 seconds). Only the pod image changes; the chart, values, and infrastructure remain the same.
Tier 2: Chart regression
If a template or values change caused the problem, revert the
charts/icepack/Chart.yaml version to the previous value and apply:
git revert <commit-that-broke-the-chart>terraform applyTerraform detects the version change and performs a Helm rollback internally. This takes slightly longer because Kubernetes must reconcile the full chart diff.
Tier 3: Postgres corruption
If the Postgres data is corrupted or schema migrations went wrong:
- PVC-backed internal Postgres — Delete the Postgres deployment, delete
the PVC, and re-apply. Icepack auto-creates tables on startup via
CREATE TABLE IF NOT EXISTSDDL, so a fresh database starts cleanly. Historical job data will be lost. - RDS-backed external Postgres — Restore from an RDS snapshot to a point-in-time before the corruption. Re-run any pending schema migrations after restore.
In both cases, verify recovery by checking GET /readyz returns 200 and
submitting a test dry-run job.