Operator Runbook¶

This runbook is for operators running django_ray_worker in staging/production.

It focuses on:

fast incident triage,
safe manual recovery actions,
expected metrics and alert signals.

Scope¶

This covers the django-ray library runtime:

RayTaskExecution task lifecycle rows,
TaskInputPayload durable-input registry and cleanup tombstones,
normalized workflow-progress topology/detail retention,
TaskWorkerLease worker heartbeat/coordination rows,
django_ray_worker process behavior.

It does not cover custom business task logic internals.

Quick Triage Checklist¶

Check worker process health (django_ray_worker logs, pod/process status).
Check task state distribution (QUEUED, RUNNING, FAILED, LOST, CANCELLING).
Check active worker leases and heartbeat freshness.
Check Ray connectivity from workers.
Check whether failures are retrying or already terminal.
For input failures, verify the configured backend, object access, and registry state.
For workflow-progress cleanup failures, inspect the bounded cleanup_error code on the retained run or pending manifest without copying progress payloads into logs.

Primary Signals¶

From /api/metrics in the example project:

django_ray_tasks_total{state="..."}
django_ray_tasks_queued
django_ray_tasks_running
django_ray_queue_depth{queue="..."}
django_ray_queue_wait_seconds_*
django_ray_claim_latency_seconds_*
django_ray_execution_duration_seconds_*
django_ray_retries_recorded
django_ray_failures_recorded
django_ray_timeouts_recorded
django_ray_worker_leases{status="..."}

The example endpoint requires its bearer token. Production deployments should use a dedicated authenticated scrape identity or a network-restricted adapter. Queue series appear only for the application's explicit allowlist.

Expected behavior:

steady state: queued remains near baseline while tasks complete.
incident signal: queued rises while running stays near zero.
incident signal: running grows and does not drain.
incident signal: failed/lost rises quickly with the same callable path.
incident signal: stale leases rise while claim latency and queue depth increase.

Safety Model¶

Treat production tasks as at-least-once by default. django-ray can retry after app exceptions, lost worker ownership, Ray connection loss, and unknown completion state. That protects throughput and recovery, but it cannot prove that a crashed or disconnected attempt made no external side effects before disappearing.

Before enabling automatic retries on side-effecting callables, confirm the task has an idempotency key or operation table that makes duplicate execution harmless. For payments, emails, webhooks, or external writes, prefer a deduplicated commit record over relying on task attempt counts.

Useful Queries¶

-- Task counts by state
SELECT state, COUNT(*) AS count
FROM django_ray_raytaskexecution
GROUP BY state
ORDER BY state;

-- Long-running tasks (example threshold: 10 minutes)
SELECT id, task_id, callable_path, queue_name, claimed_by_worker, started_at
FROM django_ray_raytaskexecution
WHERE state = 'RUNNING'
  AND started_at < NOW() - INTERVAL '10 minutes'
ORDER BY started_at ASC;

-- Worker leases ordered by oldest heartbeat
SELECT worker_id, hostname, pid, queue_name, is_active, last_heartbeat_at
FROM django_ray_taskworkerlease
ORDER BY last_heartbeat_at ASC;

Incident Playbooks¶

1) Queue Backlog Keeps Growing¶

Symptoms:

django_ray_tasks_queued rising for several minutes.
few or no new RUNNING tasks.

Checks:

Verify at least one worker is active and healthy.
Verify workers are polling the intended queue(s).
Verify Ray connectivity if running --local or --cluster modes.

Recovery:

Restart unhealthy worker processes/pods.
Confirm queue flags/settings match enqueue queue names.
If tasks are delayed by retries, inspect run_after timestamps before forcing retries.
Confirm WORKER_POLL_INTERVAL_SECONDS and WORKER_POLL_MAX_INTERVAL_SECONDS. An idle worker may take up to the configured maximum polling delay to observe newly enqueued work; this is not an end-to-end task-start bound. Sustained activity resets it to the base.

If database load is high while queues are empty, run the PostgreSQL polling benchmark from the performance guide before tuning. Increase the maximum gradually and compare idle queries per worker-second with p95 claim latency. Do not lengthen heartbeat, timeout, or cancellation settings to reduce claim-query load; those schedules are independent safety controls.

2) Tasks Stuck In RUNNING¶

Symptoms:

many RUNNING rows with stale started_at/last_heartbeat_at.

Checks:

Confirm owning workers (claimed_by_worker) still have active leases.
Confirm STUCK_TASK_TIMEOUT_SECONDS and timeout_seconds are appropriate.

Recovery:

Let normal recovery run first (detect_stuck_tasks handles orphaned ownership and Ray Job reconciliation can adopt persisted jobs from inactive workers).
If needed, requeue only clearly orphaned/failed tasks using admin retry actions.

Notes:

A fresh last_heartbeat_at can mean either the owning worker is healthy or another worker is still actively monitoring/reconciling the task.
UNKNOWN Ray Job states are intentionally allowed to age into stuck-task recovery once monitor heartbeats stop advancing.

Optional targeted SQL (use carefully):

-- Example: inspect orphan candidates first (do not update blindly)
SELECT id, task_id, callable_path, claimed_by_worker, started_at
FROM django_ray_raytaskexecution
WHERE state = 'RUNNING'
ORDER BY started_at ASC;

3) Retry Storm / Repeated Failures¶

Symptoms:

fast growth in FAILED + LOST.
same callable repeatedly retried.

Checks:

Identify dominant failing callable_path.
Inspect latest error_message and error_traceback.
Confirm denylist policy (RETRY_EXCEPTION_DENYLIST) for non-retryable failures.

Recovery:

Stop or scale down workers if failures are harmful/high-volume.
Fix configuration or task code root cause.
Requeue failed tasks in controlled batches after fix.

4) Durable Input Retrieval or Cleanup Failure¶

Symptoms:

task errors report a missing, malformed, unauthorized, or corrupt input reference;
the task fails before application logs appear;
django_ray_purge_inputs records cleanup_error or exits non-zero.

Checks:

Inspect input_reference and the matching TaskInputPayload state without copying the payload into tickets or logs.
Confirm every enqueueing and worker process uses the same input backend, filesystem root, bucket, prefix, and credentials.
Verify the object exists and its access policy has not changed.
If cleanup failed, inspect cleanup_error and fix storage access before rerunning.

Recovery:

Restore the exact content-addressed object or correct storage configuration.
Use a controlled manual retry only after retrieval succeeds. Validation failures do not auto-retry; storage retrieval failures may already follow normal retry policy.
Preview retention with django_ray_purge_inputs --retention-days=30 before using --delete. Increase retention when historical manual retry or audit access is needed.

Do not edit input_reference, digest metadata, or JSON placeholders by hand. Successful cleanup retains a PURGED tombstone and execution references for audit.

5) Workflow Progress Retention or Orphan Cleanup Failure¶

Symptoms:

expired normalized detail continues to consume database storage;
unpublished topology candidates or unreferenced pages remain for more than one hour;
inactive run-storage shells remain after their last candidate or orphan page is gone;
django_ray_cleanup_workflow_progress --delete exits nonzero and a retained run or pending manifest has cleanup_error.

Checks:

Preview one bounded pass with django_ray_cleanup_workflow_progress.
Confirm terminal detail has detail_retention_days and detail_expires_at recorded; cleanup does not invent or override the lifecycle-owned retention deadline.
Treat an exact active task identity as protected even if a malformed or manually altered expiry appears to be in the past.
Inspect the bounded cleanup code and exception type. Exception messages are intentionally redacted; use database and application logs under the deployment's normal secret-handling policy for deeper diagnosis.

Recovery:

Correct the database, lock-timeout, or integrity condition reported by operations.
Rerun the dry run, then run django_ray_cleanup_workflow_progress --batch-size=100 --delete.
Repeat bounded passes until zero eligible items remain. One failed item does not prevent later candidates in a pass, and clean candidates are processed before retry rows so a permanent oldest failure cannot starve later work. Any failure still makes that command invocation exit nonzero.

Cleanup removes expired run-owned detail or old unpublished orphans only. It does not rewrite the task-row or TaskAttempt summary, task state, or terminal outcome. Do not manually delete current manifests or referenced pages; their references are part of the atomic publication and integrity contract.

Apply migration 0013_workflow_progress_detail_storage before scheduling the command. It is safe to establish the cleanup schedule while the schema-v3 runtime writer remains disabled for the reader-first rollout, #79 live-ingestion bound, and #142 composite preparation. Start with dry-run monitoring, then use bounded --delete passes at a cadence appropriate to database growth and the configured retention window. Alert on a nonzero exit and on an eligible count that does not fall across repeated successful passes.

A manifest, digest, aggregate, or node-key integrity error during publication is a fail-closed storage fault. Do not manually promote the pending manifest or advance the task summary pointer. Preserve the bounded diagnostic and relevant database logs, stop the affected producer if it is repeatedly retrying, and investigate the exact run identity before allowing a fresh publication.

Schedule django_ray_audit_workflow_progress for exact runs selected by operational policy, and run it before attempting manual storage recovery. The audit is read-only: it locks the task then exact run, verifies the current topology and every bounded latest-state row, and exits nonzero without changing the corrupt evidence. Supply all four identity fields from the summary or retained attempt record; do not substitute only the current task primary key when auditing an older retained run. An audit reads no more than the protocol limit plus one detail row and is intentionally separate from the sparse publication path.

6) Ray Connection Loss¶

Symptoms:

worker logs contain reconnect warnings/errors.
tasks fail with Ray connection errors.

Checks:

Verify RAY_ADDRESS and network reachability.
Verify Ray head/dashboard/cluster health.
Verify worker mode (--local, --cluster, default runner mode).

Recovery:

Restore Ray cluster/network.
Restart workers after Ray is healthy.
Verify pending tasks move back to QUEUED/RUNNING.

7) Cancellation Stuck In CANCELLING¶

Symptoms:

rows remain CANCELLING for too long.

Checks:

Confirm owning worker is still running.
Confirm cancellation processing runs in worker loop.

Recovery:

Restart worker if cancellation loop is stalled.
After restart, verify CANCELLING -> CANCELLED transitions complete.

8) Oversized Results¶

Symptoms:

successful tasks with result_data = NULL and populated result_reference.

Interpretation:

this is expected when result payload exceeds MAX_RESULT_SIZE_BYTES.
reference format indicates backend:
oversize://sha256/... -> digest-only pointer (no external payload)
resultfs://sha256/... -> filesystem-backed payload reference
s3://... -> S3/object-storage payload reference
gs://... -> GCS payload reference

Recovery:

If this is unexpected, reduce result payload size in task design.
If retrieval is required, configure a retrievable backend:
RESULT_STORAGE_BACKEND="filesystem" with RESULT_STORAGE_FILESYSTEM_PATH=<shared path>
RESULT_STORAGE_BACKEND="s3" with bucket/config and working credentials
RESULT_STORAGE_BACKEND="gcs" with bucket/config and working credentials
RayTaskBackend.get_result() can rehydrate resultfs://..., s3://..., and gs://... references when the reader has matching storage configuration. oversize://... digest references remain metadata-only.

Safe Manual Actions¶

Prefer Django admin actions for retries/cancellations before direct SQL updates.

If scripting is necessary, use Django shell and narrow filters:

from django_ray.models import RayTaskExecution, TaskState

qs = RayTaskExecution.objects.filter(state=TaskState.FAILED, queue_name="default")[:100]
for task in qs:
    task.state = TaskState.QUEUED
    task.started_at = None
    task.finished_at = None
    task.error_message = None
    task.error_traceback = None
    task.claimed_by_worker = None
    task.ray_job_id = None
    task.save()

Escalation Guidance¶

Escalate to development team when:

repeated failures persist after configuration/network fixes,
retry policy behavior differs across worker modes,
state transitions violate expected lifecycle semantics,
manual recovery requires broad database edits.

Operator Runbook¶

Scope¶

Quick Triage Checklist¶

Primary Signals¶

Safety Model¶

Useful Queries¶

Incident Playbooks¶

1) Queue Backlog Keeps Growing¶

2) Tasks Stuck In RUNNING¶

3) Retry Storm / Repeated Failures¶

4) Durable Input Retrieval or Cleanup Failure¶

5) Workflow Progress Retention or Orphan Cleanup Failure¶

6) Ray Connection Loss¶

7) Cancellation Stuck In CANCELLING¶

8) Oversized Results¶

Safe Manual Actions¶

Escalation Guidance¶

Related Docs¶