Skip to content

Observability

Icepack exposes metrics via OpenTelemetry, emits structured logs via structlog, and owns operational dashboard and alert assets in the repository.

Metrics

The API mounts Prometheus-formatted metrics at /metrics on the same FastAPI service port as the application (8000 in the container). Worker and orchestrator pods are ephemeral jobs: they record operational metrics while executing, and can push OTLP metrics when OTEL_EXPORTER_OTLP_ENDPOINT is configured, but they do not serve long-lived scrape endpoints.

/metrics is operational telemetry for Icepack itself. Raw Iceberg table state is exposed by /tables/{database}/{table}/status; table health assessments are exposed by /tables/{database}/{table}/health.

Metric inventory

MetricTypeLabelsDescription
icepack.job.durationHistogramdatabase, table_name, status, environmentEnd-to-end job wall-clock time
icepack.job.totalCounterdatabase, table_name, status, environmentTotal jobs by terminal status
icepack.action.durationHistogramdatabase, table_name, action, status, environmentPer-action wall-clock time
icepack.action.totalCounterdatabase, table_name, action, status, environmentTotal action executions
icepack.queue.depthObservable GaugeenvironmentNumber of unclaimed, visible jobs in the queue
icepack.workers.activeObservable GaugeenvironmentCount of distinct claimed_by values (active workers)
icepack.tables.discoveredGaugeNumber of tables in the table cache
icepack.orchestrator.run.durationHistogramenvironment, status, dry_runWall-clock time for one orchestrator run
icepack.orchestrator.run.totalCounterenvironment, status, dry_runTotal orchestrator runs by terminal status
icepack.orchestrator.tables.assessedHistogramenvironment, status, dry_runTables assessed by one orchestrator run
icepack.orchestrator.tables.skippedHistogramenvironment, status, dry_runTables skipped by one orchestrator run
icepack.orchestrator.tables.healthyHistogramenvironment, status, dry_runTables reported healthy by one orchestrator run
icepack.orchestrator.tables.candidatesHistogramenvironment, status, dry_runTables selected as maintenance candidates by one orchestrator run
icepack.orchestrator.jobs.submittedHistogramenvironment, status, dry_runMaintenance jobs submitted by one orchestrator run
icepack.orchestrator.jobs.completedHistogramenvironment, status, dry_runMaintenance jobs completed during one orchestrator run
icepack.orchestrator.jobs.failedHistogramenvironment, status, dry_runMaintenance jobs failed during one orchestrator run
icepack.orchestrator.recommendation.check.totalCounterenvironment, statusRecommendation endpoint checks by outcome

Observable gauges

icepack.queue.depth and icepack.workers.active are observable gauges — they re-query Postgres on every scrape rather than maintaining in-memory counters. This design eliminates state drift across API replicas: every scrape returns the true current value from the database, regardless of which replica serves the request.

The health-sync path collects table status, derives health, persists both payloads to Postgres, and can push Iceberg table health gauges to Mimir through OTLP when ICEPACK_MIMIR_ENDPOINT is set. Those gauges use the iceberg.table.* namespace and are separate from the API /metrics endpoint.

Structured logging

All log output uses structlog with JSON formatting. Key structured events emitted across the system:

EventEmitted byDescription
job_startedWorkerJob claim succeeded, execution beginning
job_completedWorkerAll actions finished successfully
job_failedWorkerJob reached terminal failure
job_crashedWorkerUnhandled exception during execution
table_cache_syncedAPI / TableCacheSyncWorkerTable cache refreshed from catalog
executor_startedWorkerSpark query engine connected
health_sync_startedHealth SyncHealth sync cycle beginning
health_sync_discoveredHealth SyncTables discovered from catalog
health_sync_collectedHealth SyncStatus and health data collected for tables
health_sync_pg_writtenHealth SyncStatus snapshots and health assessments persisted to Postgres

Grafana dashboard and alerts

Icepack owns its operational dashboard source in monitoring/dashboards/icepack-operational-overview.json. Grafana Git Sync publishes that dashboard from the repository; dashboard publication is separate from alert rule publication. Grafana Git Sync is folder-scoped in this stack, so the repository’s spec.title creates a dedicated provisioned folder named Data Platform (git sync).

Icepack also owns the Grafana Git Sync repository resource in monitoring/grafana/icepack-repository.yaml. The resource points Grafana at the Icepack repository’s monitoring/dashboards path and references the existing Grafana GitHub App connection. It is published by .github/workflows/publish-grafana-monitoring.yml using the fetch-rewards/grafana-resource-provisioner-action wrapper around gcx.

Icepack owns OaaSis alert rules in monitoring/oaasis/monitoring.yml. The manifest is uploaded by .github/workflows/publish-oaasis-monitoring.yml using the explicit OaaSis path-to-monitoring-manifest input because the manifest is kept under monitoring/ rather than at the repository root.

Initial DL-505 alerts are Slack-only to #lakehouse-alerts and filter environment="prod".