Operator Runbook¶
This runbook is for operators running django_ray_worker in staging/production.
It focuses on:
- fast incident triage,
- safe manual recovery actions,
- expected metrics and alert signals.
Scope¶
This covers the django-ray library runtime:
RayTaskExecutiontask lifecycle rows,TaskWorkerLeaseworker heartbeat/coordination rows,django_ray_workerprocess behavior.
It does not cover custom business task logic internals.
Quick Triage Checklist¶
- Check worker process health (
django_ray_workerlogs, pod/process status). - Check task state distribution (
QUEUED,RUNNING,FAILED,LOST,CANCELLING). - Check active worker leases and heartbeat freshness.
- Check Ray connectivity from workers.
- Check whether failures are retrying or already terminal.
Primary Signals¶
From /api/metrics in the example project:
django_ray_tasks_total{state="..."}django_ray_tasks_queueddjango_ray_tasks_runningdjango_ray_queue_depth{queue="..."}
Expected behavior:
- steady state:
queuedremains near baseline while tasks complete. - incident signal:
queuedrises whilerunningstays near zero. - incident signal:
runninggrows and does not drain. - incident signal:
failed/lostrises quickly with the same callable path.
Safety Model¶
Treat production tasks as at-least-once by default. django-ray can retry after app exceptions,
lost worker ownership, Ray connection loss, and unknown completion state. That protects throughput
and recovery, but it cannot prove that a crashed or disconnected attempt made no external side
effects before disappearing.
Before enabling automatic retries on side-effecting callables, confirm the task has an idempotency key or operation table that makes duplicate execution harmless. For payments, emails, webhooks, or external writes, prefer a deduplicated commit record over relying on task attempt counts.
Useful Queries¶
-- Task counts by state
SELECT state, COUNT(*) AS count
FROM django_ray_raytaskexecution
GROUP BY state
ORDER BY state;
-- Long-running tasks (example threshold: 10 minutes)
SELECT id, task_id, callable_path, queue_name, claimed_by_worker, started_at
FROM django_ray_raytaskexecution
WHERE state = 'RUNNING'
AND started_at < NOW() - INTERVAL '10 minutes'
ORDER BY started_at ASC;
-- Worker leases ordered by oldest heartbeat
SELECT worker_id, hostname, pid, queue_name, is_active, last_heartbeat_at
FROM django_ray_taskworkerlease
ORDER BY last_heartbeat_at ASC;
Incident Playbooks¶
1) Queue Backlog Keeps Growing¶
Symptoms:
django_ray_tasks_queuedrising for several minutes.- few or no new
RUNNINGtasks.
Checks:
- Verify at least one worker is active and healthy.
- Verify workers are polling the intended queue(s).
- Verify Ray connectivity if running
--localor--clustermodes.
Recovery:
- Restart unhealthy worker processes/pods.
- Confirm queue flags/settings match enqueue queue names.
- If tasks are delayed by retries, inspect
run_aftertimestamps before forcing retries.
2) Tasks Stuck In RUNNING¶
Symptoms:
- many
RUNNINGrows with stalestarted_at/last_heartbeat_at.
Checks:
- Confirm owning workers (
claimed_by_worker) still have active leases. - Confirm
STUCK_TASK_TIMEOUT_SECONDSandtimeout_secondsare appropriate.
Recovery:
- Let normal recovery run first (
detect_stuck_taskshandles orphaned ownership and Ray Job reconciliation can adopt persisted jobs from inactive workers). - If needed, requeue only clearly orphaned/failed tasks using admin retry actions.
Notes:
- A fresh
last_heartbeat_atcan mean either the owning worker is healthy or another worker is still actively monitoring/reconciling the task. UNKNOWNRay Job states are intentionally allowed to age into stuck-task recovery once monitor heartbeats stop advancing.
Optional targeted SQL (use carefully):
-- Example: inspect orphan candidates first (do not update blindly)
SELECT id, task_id, callable_path, claimed_by_worker, started_at
FROM django_ray_raytaskexecution
WHERE state = 'RUNNING'
ORDER BY started_at ASC;
3) Retry Storm / Repeated Failures¶
Symptoms:
- fast growth in
FAILED+LOST. - same callable repeatedly retried.
Checks:
- Identify dominant failing
callable_path. - Inspect latest
error_messageanderror_traceback. - Confirm denylist policy (
RETRY_EXCEPTION_DENYLIST) for non-retryable failures.
Recovery:
- Stop or scale down workers if failures are harmful/high-volume.
- Fix configuration or task code root cause.
- Requeue failed tasks in controlled batches after fix.
4) Ray Connection Loss¶
Symptoms:
- worker logs contain reconnect warnings/errors.
- tasks fail with Ray connection errors.
Checks:
- Verify
RAY_ADDRESSand network reachability. - Verify Ray head/dashboard/cluster health.
- Verify worker mode (
--local,--cluster, default runner mode).
Recovery:
- Restore Ray cluster/network.
- Restart workers after Ray is healthy.
- Verify pending tasks move back to
QUEUED/RUNNING.
5) Cancellation Stuck In CANCELLING¶
Symptoms:
- rows remain
CANCELLINGfor too long.
Checks:
- Confirm owning worker is still running.
- Confirm cancellation processing runs in worker loop.
Recovery:
- Restart worker if cancellation loop is stalled.
- After restart, verify
CANCELLING -> CANCELLEDtransitions complete.
6) Oversized Results¶
Symptoms:
- successful tasks with
result_data = NULLand populatedresult_reference.
Interpretation:
- this is expected when result payload exceeds
MAX_RESULT_SIZE_BYTES. - reference format indicates backend:
oversize://sha256/...-> digest-only pointer (no external payload)resultfs://sha256/...-> filesystem-backed payload references3://...-> S3/object-storage payload referencegs://...-> GCS payload reference
Recovery:
- If this is unexpected, reduce result payload size in task design.
- If retrieval is required, configure a retrievable backend:
RESULT_STORAGE_BACKEND="filesystem"withRESULT_STORAGE_FILESYSTEM_PATH=<shared path>RESULT_STORAGE_BACKEND="s3"with bucket/config and working credentialsRESULT_STORAGE_BACKEND="gcs"with bucket/config and working credentialsRayTaskBackend.get_result()can rehydrateresultfs://...,s3://..., andgs://...references when the reader has matching storage configuration.oversize://...digest references remain metadata-only.
Safe Manual Actions¶
Prefer Django admin actions for retries/cancellations before direct SQL updates.
If scripting is necessary, use Django shell and narrow filters:
from django_ray.models import RayTaskExecution, TaskState
qs = RayTaskExecution.objects.filter(state=TaskState.FAILED, queue_name="default")[:100]
for task in qs:
task.state = TaskState.QUEUED
task.started_at = None
task.finished_at = None
task.error_message = None
task.error_traceback = None
task.claimed_by_worker = None
task.ray_job_id = None
task.save()
Escalation Guidance¶
Escalate to development team when:
- repeated failures persist after configuration/network fixes,
- retry policy behavior differs across worker modes,
- state transitions violate expected lifecycle semantics,
- manual recovery requires broad database edits.