Deployment

Icepack is deployed via Terraform using a helm_release resource that manages the charts/icepack/ Helm chart. All changes flow through terraform apply — never run helm install, helm upgrade, or helm uninstall directly.

Deployment flow

Every deployment follows the same sequence:

Modify the chart or application code under charts/icepack/.
Bump the version in charts/icepack/Chart.yaml. Terraform tracks the chart by path + version + values. If you change template files without bumping the version, terraform plan will show no diff and your changes will not deploy.
Build and push the Docker image to ECR (if application code changed).
Plan and apply from the target environment directory:

cd terraform/icepack-api/env/<environment>   # dev, stage, preprod, or prod
terraform plan
terraform apply

Terraform compares the new chart version and values against its state, generates a Helm upgrade under the hood, and records the result. The entire cycle takes roughly 60-90 seconds for a clean apply.

Contract or schema migrations

For changes that alter API contracts, cached payload tables, or scheduler decisioning, prefer a short maintenance window over mixed-version compatibility. Pause the orchestrator and health-sync CronJobs before the deploy; drain mode can be enabled as an additional guard against manual maintenance submissions, but it does not stop health-sync or other read-side cache writers. Bring the system back in a known-good order:

Pause scheduler/orchestrator submissions and health-sync.
Apply schema changes through the normal application startup path or explicit migration for the target release.
Deploy API, UI, worker, and orchestrator changes together via Terraform.
Resume health-sync and verify table_status_snapshots and table_health_snapshots are being populated.

Verify the split read contracts for a known table:

curl -s "$ICEPACK_API_URL/tables/<database>/<table>/status" | jq keys
curl -s "$ICEPACK_API_URL/tables/<database>/<table>/health" | jq keys
curl -s "$ICEPACK_API_URL/tables/<database>/<table>/maintenance/recommendation" | jq keys

Resume scheduler/orchestrator runs.

Expected top-level keys:

Endpoint	Keys
`/status`	`collection`, `collected_at`, `derived`, `files`, `last_error`, `manifests`, `partitions`, `snapshot`, `source`, `stale`, `table`
`/health`	`assessed_at`, `collection`, `error`, `health_status`, `issues`, `status_collected_at`, `table`
`/maintenance/recommendation`	`error`, `evaluated_at`, `evidence`, `history`, `policy`, `recommended_actions`, `skip_reasons`, `status_collected_at`, `table`

Environments

Icepack is deployed to four environments, each in its own AWS account and EKS cluster:

Environment	Account	Terraform root	Values file
dev	`411706531007`	`terraform/icepack-api/env/dev/`	`charts/icepack/values-dev.yaml`
stage	`174558682303`	`terraform/icepack-api/env/stage/`	`charts/icepack/values-stage.yaml`
preprod	`391322830928`	`terraform/icepack-api/env/preprod/`	`charts/icepack/values-preprod.yaml`
prod	`857967394368`	`terraform/icepack-api/env/prod/`	`charts/icepack/values-prod.yaml`

Each environment has its own IRSA role, Polaris service principal, Postgres instance, and Secrets Manager entries. The CI/CD workflow (.github/workflows/deploy-icepack.yml) supports all four environments via workflow_dispatch with an environment selector; pushes to main auto-deploy to dev only. Prod deploys require manual approval via the main-deploy GitHub environment gate.

Why Terraform only

Terraform owns the Helm release via the helm_release resource. Running Helm commands directly (even helm status) is fine for read-only inspection, but any mutating Helm command creates state drift. After a direct helm upgrade, the next terraform apply will see a version mismatch and either fail or force a destructive re-deploy.

If you need to inspect what is deployed:

# Safe read-only Helm commands
helm list -n icepack
helm get values icepack -n icepack

Chart versioning rule

Bump version in charts/icepack/Chart.yaml whenever you modify any file under charts/icepack/templates/. This is the only mechanism Terraform uses to detect chart changes. A common mistake is modifying a template and forgetting the version bump — terraform plan shows “No changes” and the new template never deploys.

Infrastructure inventory

Resource	Kind	Key settings
API	Deployment (2 replicas, arm64)	Port 8000; defaults 250m/256Mi requests and 500m/512Mi limits; dev overrides to 500m/1Gi requests and 1 CPU/8Gi limits
NLB Service	LoadBalancer	ACM TLS termination, external-dns hostname
Worker	KEDA ScaledJob	postgresql trigger, 30s polling, max 5 replicas (default) / 3 (dev, stage, preprod)
Orchestrator	CronJob	Hourly at :30 (dev, stage, preprod), every 2h (default chart), concurrencyPolicy: Forbid
Health Sync	CronJob	Every 15 minutes, concurrencyPolicy: Forbid
Postgres	Deployment (1 replica when `postgres.internal.enabled=true`)	17-alpine, PVC-backed (gp3), Recreate strategy
IRSA ServiceAccount	ServiceAccount	eks.amazonaws.com/role-arn annotation, OIDC-bound

All resources live in the icepack namespace. Nodes are selected by kubernetes.io/arch: arm64.

Rollback tiers

Icepack has three rollback tiers depending on the severity of the issue.

Tier 1: Application bug

Flip the image tag back to the last known-good version and apply:

# In terraform.tfvars or the helm_release set block
image_tag = "abc123-previous"

terraform apply

This is the fastest rollback (~30 seconds). Only the pod image changes; the chart, values, and infrastructure remain the same.

Tier 2: Chart regression

If a template or values change caused the problem, revert the charts/icepack/Chart.yaml version to the previous value and apply:

git revert <commit-that-broke-the-chart>
terraform apply

Terraform detects the version change and performs a Helm rollback internally. This takes slightly longer because Kubernetes must reconcile the full chart diff.

Tier 3: Postgres corruption

If the Postgres data is corrupted or schema migrations went wrong:

PVC-backed internal Postgres — Delete the Postgres deployment, delete the PVC, and re-apply. Icepack auto-creates tables on startup via CREATE TABLE IF NOT EXISTS DDL, so a fresh database starts cleanly. Historical job data will be lost.
RDS-backed external Postgres — Restore from an RDS snapshot to a point-in-time before the corruption. Re-run any pending schema migrations after restore.

In both cases, verify recovery by checking GET /readyz returns 200 and submitting a test dry-run job.