Troubleshooting

This page covers the most common failure modes, their symptoms, and how to resolve them.

Workers not scaling

Symptom: The job_queue table has unclaimed rows, but no worker pods are spawning. kubectl get pods -n icepack shows no worker pods.

Cause: The KEDA ScaledJob is not triggering. This usually means the postgresql trigger cannot connect to the database, or the trigger query is not returning the expected result.

Diagnosis:

Check the ScaledJob status:
Terminal window
```
kubectl get scaledjob -n icepack
```
Look at the READY column. If it shows False, KEDA cannot evaluate the trigger.
Check KEDA operator logs for connection errors:
Terminal window
```
kubectl logs -n keda -l app=keda-operator --tail=50
```

Verify the job_queue has unclaimed work:

SELECT COUNT(*) FROM job_queue
WHERE claimed_at IS NULL AND visible_at <= NOW();

Verify activationTargetQueryValue is set in the ScaledJob trigger metadata. KEDA 2.12+ requires activationTargetQueryValue (not activationLagCount) to activate from zero replicas. Without it, KEDA will not scale up from zero even when there is work in the queue.

Resolution: Fix the KEDA trigger authentication (check the Secret referenced by TriggerAuthentication), verify Postgres connectivity from the KEDA namespace, and confirm the activationTargetQueryValue is present.

Jobs stuck in pending

Symptom: Jobs show status: pending for longer than expected. Workers may or may not be running.

Cause: Several possible causes:

KEDA polling interval is 30 seconds, so there is an inherent delay between a job being queued and a worker pod starting.
The visible_at timestamp on the queue row may be in the future (retry backoff).
A stale claim from a crashed worker may be blocking the row. The reclaim_stale sweeper releases claims older than 30 minutes, but this requires the API process to be running.

Diagnosis:

Check queue row timestamps:

SELECT job_id, visible_at, claimed_at
FROM job_queue
WHERE claimed_at IS NULL
ORDER BY visible_at;

Check for stale claims (claimed but not progressing):

SELECT job_id, claimed_at
FROM job_queue
WHERE claimed_at IS NOT NULL
  AND claimed_at < NOW() - INTERVAL '30 minutes';

Verify the API is running (the reclaim_stale sweeper runs inside the API process):
Terminal window
```
kubectl get pods -n icepack -l app.kubernetes.io/component=api
```

Resolution: If stale claims exist and the API is running, the sweeper will reclaim them within 30 seconds. If the API is not running, fix the API first — the sweeper cannot run without it. For jobs stuck behind a future visible_at, wait for the backoff window to expire.

Health sync OOM

Symptom: The health-sync CronJob pod is OOMKilled. kubectl describe pod shows the container exceeded its memory limit.

Cause: PyIceberg loads table metadata into memory. With high concurrency and many large tables, the combined memory footprint exceeds the pod’s limit. This was tracked in DL-278.

Diagnosis:

kubectl get pods -n icepack -l app.kubernetes.io/component=health-sync --sort-by=.status.startTime
kubectl describe pod <oom-killed-pod> -n icepack

Look for Last State: Terminated with Reason: OOMKilled and check the memory limit in the container spec.

Resolution: Reduce the ICEPACK_HEALTH_SYNC_CONCURRENCY setting. For the dev environment the concurrency is set to 2 (down from the default 10). In the Helm values:

healthSync:
  concurrency: 2
  resources:
    limits:
      memory: 768Mi

If the problem persists even at concurrency 2, increase the memory limit rather than lowering concurrency further — at concurrency 1 the sync window may exceed the 15-minute CronJob interval.

For deployments using ICEPACK_METADATA_INSPECTOR=iceberg-go, DL-605 removes the helper’s PlanFiles() health path. The helper streams live entries from the current snapshot’s manifests instead of materializing a full file-task list, so high-file-count tables should not require small-file estimation to avoid OOM. If an iceberg-go health-sync pod is still OOMKilled after DL-605, capture the table name, data-file count, manifest count, pod memory limit, and helper stderr before increasing resources.

Table not appearing in orchestrator

Symptom: A table you expect to be maintained never has maintenance submitted for it.

Cause: In opt-out mode (the default) the orchestrator processes a table only when all of these hold:

The table appears in the API table cache (GET /tables).
The table’s database is listed in the databases value.
The table has not opted out — icepack.maintenance_enabled is not false and compaction_skip is not true.

(In opt-in mode, condition 3 is stricter: the table must have icepack.maintenance_enabled = true.) If any condition is not met, the orchestrator skips the table silently.

Diagnosis:

Verify the table appears in the table cache and check its allowlist flag:
Terminal window
```
curl -s "https://<icepack-host>/tables?database=<database>" | jq '.[] | {table_name, in_maintenance_allowlist}'
```
If in_maintenance_allowlist is false, the database is not in the databases value. In the web UI, such tables appear greyed out and disabled. If the table is not in the response at all, the API TableCacheSyncWorker has not discovered it yet or the catalog cannot list it.
Check the table’s icepack.maintenance_enabled and compaction_skip properties from Spark/Kyuubi or the Iceberg catalog. maintenance_enabled = false or compaction_skip = true opts the table out.
Check the databases value (the single list that drives both maintenance and health sync):
Terminal window
```
helm get values icepack -n icepack | grep -A1 '^databases'
```
The table’s database must be in this list. A database absent from it is neither maintained nor health-synced (cached status and health lookups would 404 until a live check persists snapshots).
Check the table’s recommendation:
Terminal window
```
curl -s https://<icepack-host>/tables/<database>/<table>/maintenance/recommendation | jq '{
  recommended_actions,
  skip_reasons,
  policy,
  history,
  error
}'
```
skip_reasons explains policy and cadence skips such as operation_disabled, recent_success, table_not_in_allowlist, or status_unavailable.

Resolution: Ensure the database is in the databases value, and that the table has not been opted out. Then run terraform apply.

Position-delete rows stay high after maintenance

Symptom: A table still has a large number of position-delete rows after a maintenance job that included rewrite_position_delete_files. The table may also still show many delete files, small files, or old snapshots in the health page after the job completes.

Cause: rewrite_position_delete_files compacts delete files, but it does not apply live position deletes into data files by itself. To materially reduce live position-delete rows, Icepack also needs rewrite_data_files to select the delete-heavy data files. The default delete-pressure options are conservative:

Property	Default	Meaning
`icepack.rewrite_data_delete_file_threshold`	`2`	Rewrite data files with at least this many associated delete files.
`icepack.rewrite_data_delete_ratio_threshold`	`0.10`	Rewrite data files when deleted rows are at least this fraction of the data file.

For a table with broad position-delete pressure, those defaults may skip data files that only have one associated delete file or less than 10% deleted rows. Lowering snapshot age only removes old metadata references after rewrites make files unreferenced; it does not remove live position deletes.

Diagnosis:

Check raw table status and policy-aware recommendation separately:

curl -s https://<icepack-host>/tables/<database>/<table>/status | jq '{
  files,
  snapshot,
  derived,
  manifests
}'

curl -s https://<icepack-host>/tables/<database>/<table>/maintenance/recommendation | jq '{
  recommended_actions,
  policy,
  evidence,
  skip_reasons
}'

Confirm the recommended actions include both rewrite_data_files and rewrite_position_delete_files. If only position-delete rewrite runs, the job may compact delete files without applying deletes into rewritten data files.
Check whether the table has table-level delete-pressure overrides in policy.icepack_config.

Resolution: For offer_service.offer_eligibility_stage, use more aggressive delete-pressure selection while keeping rewrite-all disabled for normal operation:

ALTER TABLE lakehouse_dev.offer_service.offer_eligibility_stage
  SET TBLPROPERTIES (
    'icepack.maintenance_enabled' = 'true',
    'icepack.maintenance_cadence_hours' = '24',
    'icepack.max_snapshot_age_days' = '1',
    'icepack.rewrite_data_delete_file_threshold' = '1',
    'icepack.rewrite_data_delete_ratio_threshold' = '0.05',
    'icepack.rewrite_data_rewrite_all' = 'false'
  );

Then run maintenance with expire_snapshots, remove_orphan_files, rewrite_data_files, and rewrite_position_delete_files. Re-check health after the job completes and after the next health-sync pass.

If position-delete rows still do not drop materially after the targeted rewrite, use rewrite-all as a one-time remediation:

ALTER TABLE lakehouse_dev.offer_service.offer_eligibility_stage
  SET TBLPROPERTIES (
    'icepack.rewrite_data_rewrite_all' = 'true'
  );

Run the same maintenance actions once, then turn it back off:

ALTER TABLE lakehouse_dev.offer_service.offer_eligibility_stage
  SET TBLPROPERTIES (
    'icepack.rewrite_data_rewrite_all' = 'false'
  );

Do not leave icepack.rewrite_data_rewrite_all = true as a steady-state setting. It can force unnecessary full data-file rewrites on future jobs.

Maintenance fails when Spark cannot broadcast a large table

Symptom: A maintenance action fails with a Spark broadcast-size error:

org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 9.0 GiB

Cause: Spark planned a broadcast join for an Iceberg maintenance procedure, but the broadcast side exceeded Spark’s broadcast-size limit. For large maintenance inputs, this can happen during actions such as remove_orphan_files, where Spark may choose an adaptive broadcast plan that is not viable for the table.

Diagnosis:

Check the failed action error:

curl -s https://<icepack-host>/jobs/<job-id> \
  | jq '.results[]? | select(.success == false) | {action, error}'

Inspect the table’s current Spark SQL override properties:

curl -s https://<icepack-host>/tables/<database>/<table>/maintenance/recommendation \
  | jq '.policy.icepack_config | with_entries(select(.key | startswith("spark.sql.")))'

Resolution: Consider steering Icepack maintenance away from adaptive broadcast joins for the affected table:

ALTER TABLE lakehouse_<env>.<database>.<table>
  SET TBLPROPERTIES (
    'icepack.spark.sql.autoBroadcastJoinThreshold' = '-1',
    'icepack.spark.sql.adaptive.autoBroadcastJoinThreshold' = '-1',
    'icepack.spark.sql.join.preferSortMergeJoin' = 'false',
    'icepack.spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold' = '2g'
  );

See the table properties reference for the full allowlist and rollback SQL.

Maintenance fails with invalid Spark SQL override

Symptom: A maintenance action fails immediately with an error referencing a icepack.spark.sql.* property. The Iceberg procedure never executes.

Cause: The table has a icepack.spark.sql.* property with a malformed value. Icepack validates allowlisted Spark SQL overrides before running the maintenance statement — invalid values fail the action early.

Diagnosis:

Check the failed action error:
Terminal window
```
curl -s https://<icepack-host>/jobs/<job-id> \
  | jq '.results[]? | select(.success == false) | {action, error}'
```
The error names the offending property and the validation rule it violated.

Inspect the table’s current Spark SQL override properties:

curl -s https://<icepack-host>/tables/<database>/<table>/maintenance/recommendation \
  | jq '.policy.icepack_config | with_entries(select(.key | startswith("spark.sql.")))'

Resolution: Fix or remove the malformed table property via Spark SQL:

ALTER TABLE lakehouse_<env>.<database>.<table>
  SET TBLPROPERTIES ('icepack.spark.sql.<key>' = '<valid-value>');

Valid value shapes: -1 (disable sentinel, only for autoBroadcastJoinThreshold keys), byte sizes like 512m or 2g, and true/false for boolean keys. See the table properties reference for the full allowlist.

409 Conflict on maintenance submit

Symptom: POST /tables/{db}/{table}/maintenance returns 409 Conflict with the message “Maintenance already in progress for {db}.{table}”.

Cause: Another job currently holds the lock for this table. Icepack uses a table_locks table to ensure only one maintenance job runs per table at a time. The lock is acquired when a job is submitted and released when the job completes, fails, or is cancelled.

Diagnosis:

Check who holds the lock:

SELECT table_key, holder, acquired_at, expires_at
FROM table_locks
WHERE table_key = '<database>.<table>';

Check the status of the holding job:

curl -s https://<icepack-host>/jobs/<holder-job-id> | jq .status

Resolution: If the holding job is still running, wait for it to complete or cancel it with POST /jobs/{id}/cancel. If the lock has expired, the next submission for the same table can take it over atomically. If a terminal job still appears to hold a non-expired lock, inspect the worker/API logs before manually deleting the row.

Stale table cache

Symptom: The API returns outdated table lists, or newly onboarded tables are not appearing in API responses.

Cause: The table cache is populated by the API’s TableCacheSyncWorker, which refreshes on ICEPACK_TABLE_CACHE_REFRESH_SECONDS (300 seconds by default). If the worker cannot list the catalog or cannot write Postgres, the cache may become stale and /readyz will fail once the staleness window is exceeded.

Diagnosis:

Check the cache status endpoint for the last sync timestamp:

curl -s https://<icepack-host>/tables/cache-status | jq .

The response includes:

{
  "last_synced": "2026-04-25T12:15:00+00:00",
  "table_count": 142
}

If last_synced is older than the configured staleness window, the API table-cache sync is failing or blocked.

Resolution:

Check API readiness and logs:

kubectl get pods -n icepack -l app.kubernetes.io/component=api
kubectl logs -n icepack -l app.kubernetes.io/component=api --tail=100

Check for table_cache_sync_failed events, catalog authentication failures, and Postgres connection errors.

Confirm the configured cache refresh and staleness settings:

helm get values icepack -n icepack | grep -E 'tableCacheRefreshSeconds|tableCacheStalenessSeconds'

Common causes include Polaris/Glue API throttling, catalog credential failures, or Postgres connection failures. Fix the underlying issue and the next API sync cycle will repopulate the cache.