Observability

Icepack exposes metrics via OpenTelemetry, emits structured logs via structlog, and owns operational dashboard and alert assets in the repository.

Metrics

The API mounts Prometheus-formatted metrics at /metrics on the same FastAPI service port as the application (8000 in the container). Worker and orchestrator pods are ephemeral jobs: they record operational metrics while executing, and can push OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT is configured, but they do not serve long-lived scrape endpoints.

/metrics is operational telemetry for Icepack itself. Raw Iceberg table state is exposed by /tables/{database}/{table}/status; table health assessments are exposed by /tables/{database}/{table}/health.

Metric inventory

Metric	Type	Labels	Description
`icepack.job.duration`	Histogram	`database`, `table_name`, `status`, `environment`	End-to-end job wall-clock time
`icepack.job.total`	Counter	`database`, `table_name`, `status`, `environment`	Total jobs by terminal status
`icepack.action.duration`	Histogram	`database`, `table_name`, `action`, `status`, `environment`	Per-action wall-clock time
`icepack.action.total`	Counter	`database`, `table_name`, `action`, `status`, `environment`	Total action executions
`icepack.queue.depth`	Observable Gauge	`environment`	Number of unclaimed, visible jobs in the queue
`icepack.workers.active`	Observable Gauge	`environment`	Count of distinct `claimed_by` values (active workers)
`icepack.tables.discovered`	Gauge		Number of tables in the table cache
`icepack.orchestrator.run.duration`	Histogram	`environment`, `status`, `dry_run`	Wall-clock time for one orchestrator run
`icepack.orchestrator.run.total`	Counter	`environment`, `status`, `dry_run`	Total orchestrator runs by terminal status
`icepack.orchestrator.tables.assessed`	Histogram	`environment`, `status`, `dry_run`	Tables assessed by one orchestrator run
`icepack.orchestrator.tables.skipped`	Histogram	`environment`, `status`, `dry_run`	Tables skipped by one orchestrator run
`icepack.orchestrator.tables.healthy`	Histogram	`environment`, `status`, `dry_run`	Tables reported healthy by one orchestrator run
`icepack.orchestrator.tables.candidates`	Histogram	`environment`, `status`, `dry_run`	Tables selected as maintenance candidates by one orchestrator run
`icepack.orchestrator.jobs.submitted`	Histogram	`environment`, `status`, `dry_run`	Maintenance jobs submitted by one orchestrator run
`icepack.orchestrator.jobs.completed`	Histogram	`environment`, `status`, `dry_run`	Maintenance jobs completed during one orchestrator run
`icepack.orchestrator.jobs.failed`	Histogram	`environment`, `status`, `dry_run`	Maintenance jobs failed during one orchestrator run
`icepack.orchestrator.recommendation.check.total`	Counter	`environment`, `status`	Recommendation endpoint checks by outcome

Observable gauges

icepack.queue.depth and icepack.workers.active are observable gauges — they re-query Postgres on every scrape rather than maintaining in-memory counters. This design eliminates state drift across API replicas: every scrape returns the true current value from the database, regardless of which replica serves the request.

The health-sync path collects table status, derives health, persists both payloads to Postgres, and can push Iceberg table health gauges to Mimir through OTLP when ICEPACK_MIMIR_ENDPOINT is set. Those gauges use the iceberg.table.* namespace and are separate from the API /metrics endpoint.

Structured logging

All log output uses structlog with JSON formatting. Key structured events emitted across the system:

Event	Emitted by	Description
`job_started`	Worker	Job claim succeeded, execution beginning
`job_completed`	Worker	All actions finished successfully
`job_failed`	Worker	Job reached terminal failure
`job_crashed`	Worker	Unhandled exception during execution
`table_cache_synced`	API / TableCacheSyncWorker	Table cache refreshed from catalog
`executor_started`	Worker	Spark query engine connected
`health_sync_started`	Health Sync	Health sync cycle beginning
`health_sync_discovered`	Health Sync	Tables discovered from catalog
`health_sync_collected`	Health Sync	Status and health data collected for tables
`health_sync_pg_written`	Health Sync	Status snapshots and health assessments persisted to Postgres

Grafana dashboard and alerts

Icepack owns its operational dashboard source in monitoring/dashboards/icepack-operational-overview.json. Grafana Git Sync publishes that dashboard from the repository; dashboard publication is separate from alert rule publication. Grafana Git Sync is folder-scoped in this stack, so the repository’s spec.title creates a dedicated provisioned folder named Data Platform (git sync).

Icepack also owns the Grafana Git Sync repository resource in monitoring/grafana/icepack-repository.yaml. The resource points Grafana at the Icepack repository’s monitoring/dashboards path and references the existing Grafana GitHub App connection. It is published by .github/workflows/publish-grafana-monitoring.yml using the fetch-rewards/grafana-resource-provisioner-action wrapper around gcx.

Icepack owns OaaSis alert rules in monitoring/oaasis/monitoring.yml. The manifest is uploaded by .github/workflows/publish-oaasis-monitoring.yml using the explicit OaaSis path-to-monitoring-manifest input because the manifest is kept under monitoring/ rather than at the repository root.

Initial DL-505 alerts are Slack-only to #lakehouse-alerts and filter environment="prod".