Observability
Icepack exposes metrics via OpenTelemetry, emits structured logs via structlog, and owns operational dashboard and alert assets in the repository.
Metrics
The API mounts Prometheus-formatted metrics at /metrics on the same FastAPI
service port as the application (8000 in the container). Worker and
orchestrator pods are ephemeral jobs: they record operational metrics while
executing, and can push OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT is
configured, but they do not serve long-lived scrape endpoints.
/metrics is operational telemetry for Icepack itself. Raw Iceberg table state
is exposed by /tables/{database}/{table}/status; table health assessments are
exposed by /tables/{database}/{table}/health.
Metric inventory
| Metric | Type | Labels | Description |
|---|---|---|---|
icepack.job.duration | Histogram | database, table_name, status, environment | End-to-end job wall-clock time |
icepack.job.total | Counter | database, table_name, status, environment | Total jobs by terminal status |
icepack.action.duration | Histogram | database, table_name, action, status, environment | Per-action wall-clock time |
icepack.action.total | Counter | database, table_name, action, status, environment | Total action executions |
icepack.queue.depth | Observable Gauge | environment | Number of unclaimed, visible jobs in the queue |
icepack.workers.active | Observable Gauge | environment | Count of distinct claimed_by values (active workers) |
icepack.tables.discovered | Gauge | Number of tables in the table cache | |
icepack.orchestrator.run.duration | Histogram | environment, status, dry_run | Wall-clock time for one orchestrator run |
icepack.orchestrator.run.total | Counter | environment, status, dry_run | Total orchestrator runs by terminal status |
icepack.orchestrator.tables.assessed | Histogram | environment, status, dry_run | Tables assessed by one orchestrator run |
icepack.orchestrator.tables.skipped | Histogram | environment, status, dry_run | Tables skipped by one orchestrator run |
icepack.orchestrator.tables.healthy | Histogram | environment, status, dry_run | Tables reported healthy by one orchestrator run |
icepack.orchestrator.tables.candidates | Histogram | environment, status, dry_run | Tables selected as maintenance candidates by one orchestrator run |
icepack.orchestrator.jobs.submitted | Histogram | environment, status, dry_run | Maintenance jobs submitted by one orchestrator run |
icepack.orchestrator.jobs.completed | Histogram | environment, status, dry_run | Maintenance jobs completed during one orchestrator run |
icepack.orchestrator.jobs.failed | Histogram | environment, status, dry_run | Maintenance jobs failed during one orchestrator run |
icepack.orchestrator.recommendation.check.total | Counter | environment, status | Recommendation endpoint checks by outcome |
Observable gauges
icepack.queue.depth and icepack.workers.active are observable gauges —
they re-query Postgres on every scrape rather than maintaining in-memory
counters. This design eliminates state drift across API replicas: every scrape
returns the true current value from the database, regardless of which replica
serves the request.
The health-sync path collects table status, derives health, persists both
payloads to Postgres, and can push Iceberg table health gauges to Mimir through
OTLP when ICEPACK_MIMIR_ENDPOINT is set. Those gauges use the
iceberg.table.* namespace and are separate from the API /metrics endpoint.
Structured logging
All log output uses structlog with JSON formatting. Key structured events emitted across the system:
| Event | Emitted by | Description |
|---|---|---|
job_started | Worker | Job claim succeeded, execution beginning |
job_completed | Worker | All actions finished successfully |
job_failed | Worker | Job reached terminal failure |
job_crashed | Worker | Unhandled exception during execution |
table_cache_synced | API / TableCacheSyncWorker | Table cache refreshed from catalog |
executor_started | Worker | Spark query engine connected |
health_sync_started | Health Sync | Health sync cycle beginning |
health_sync_discovered | Health Sync | Tables discovered from catalog |
health_sync_collected | Health Sync | Status and health data collected for tables |
health_sync_pg_written | Health Sync | Status snapshots and health assessments persisted to Postgres |
Grafana dashboard and alerts
Icepack owns its operational dashboard source in
monitoring/dashboards/icepack-operational-overview.json. Grafana Git Sync
publishes that dashboard from the repository; dashboard publication is separate
from alert rule publication. Grafana Git Sync is folder-scoped in this stack,
so the repository’s spec.title creates a dedicated provisioned folder named
Data Platform (git sync).
Icepack also owns the Grafana Git Sync repository resource in
monitoring/grafana/icepack-repository.yaml. The resource points Grafana at
the Icepack repository’s monitoring/dashboards path and references the
existing Grafana GitHub App connection. It is published by
.github/workflows/publish-grafana-monitoring.yml using the
fetch-rewards/grafana-resource-provisioner-action wrapper around gcx.
Icepack owns OaaSis alert rules in monitoring/oaasis/monitoring.yml. The
manifest is uploaded by .github/workflows/publish-oaasis-monitoring.yml using
the explicit OaaSis path-to-monitoring-manifest input because the manifest is
kept under monitoring/ rather than at the repository root.
Initial DL-505 alerts are Slack-only to #lakehouse-alerts and filter
environment="prod".