Architecture¶
This document describes the runtime architecture of django-ray and how work moves from Django to Ray and back.
System Overview¶
django-ray integrates Django Tasks with Ray using a database-backed control plane:
- Django app code enqueues tasks through Django's Tasks API.
django-raypersists execution metadata in the database.django_ray_workerclaims work, runs it (sync/Ray Core/Ray Job), and reconciles status.- Ray executes task callables in local or cluster compute environments.
Request-To-Execution Flow¶
- App code calls
.enqueue(...)on a Django task. - Backend stores a
RayTaskExecutionrow inQUEUEDstate. - Worker claims eligible rows and marks them
RUNNING. - Worker submits task execution in the selected mode.
- Worker reconciles completion and stores success/failure details.
- Retry policy may requeue
FAILEDorLOSTtasks until attempts are exhausted.
Runtime Components¶
Django Application Processes¶
- Enqueue tasks.
- Read task status/results via Django Tasks and admin/API views.
Worker Process (django_ray_worker)¶
- Claims tasks from DB.
- Maintains worker lease heartbeats.
- Maintains task-monitor heartbeats for in-flight work it is actively reconciling.
- Submits and reconciles execution in sync/Ray Core/Ray Job modes.
- Applies retry policy and stuck-task/orphan recovery.
Ray Runtime¶
- Executes submitted functions.
- Returns completion state and result/error payloads.
Database¶
- Canonical source of truth for task lifecycle state.
- Stores worker leases for cross-worker coordination.
Data Model¶
RayTaskExecution¶
Primary execution record for one task attempt chain.
| Field | Notes |
|---|---|
id |
BigAutoField primary key |
task_id |
Django task identifier |
callable_path |
Dotted import path for callable |
queue_name |
Queue used for claim/execution |
state |
QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELLED, CANCELLING, LOST |
attempt_number |
Current attempt counter |
args_json, kwargs_json |
Serialized call arguments |
result_data |
Inline JSON result when under size limit |
result_reference |
Pointer used when result exceeds MAX_RESULT_SIZE_BYTES (digest, filesystem, s3, gcs) |
error_message, error_traceback |
Failure metadata |
ray_job_id, ray_address |
Runner-specific execution handle metadata |
claimed_by_worker |
Worker lease owner that currently owns the task |
run_after |
Delayed/retry scheduling timestamp |
timeout_seconds |
Per-task timeout override |
created_at, started_at, finished_at, last_heartbeat_at |
Lifecycle timestamps |
TaskWorkerLease¶
Worker coordination record used to detect dead/inactive workers.
| Field | Notes |
|---|---|
worker_id |
Primary key identifier for worker process |
hostname, pid |
Worker identity details |
queue_name |
Informational queue assignment |
started_at, last_heartbeat_at, stopped_at |
Lease timing |
is_active |
Active/inactive lease state |
Task State Model¶
QUEUED -> RUNNING -> SUCCEEDED
QUEUED -> CANCELLED
RUNNING -> CANCELLING -> CANCELLED
RUNNING -> FAILED
RUNNING -> LOST
FAILED/LOST -> QUEUED (if retry policy allows)
Notes:
- Retries increment
attempt_numberand setrun_afterbackoff. - Terminal failure happens after retry policy exhaustion.
Delivery Semantics¶
django-ray provides at-least-once execution semantics for retryable work. A task can be
executed more than once when a worker, Ray worker, Ray head, network connection, or process dies
after user code has performed side effects but before django-ray records the successful result.
For side-effecting tasks, use an application-level idempotency key such as the Django task id, an
order id, or another operation id guarded by a unique constraint in the system being changed. Keep
external effects such as payments, email sends, webhooks, and third-party mutations idempotent or
split them into a deduplicated commit step. SUCCEEDED means the final observed outcome succeeded;
it does not prove that every earlier execution attempt had no side effects.
Worker Loop¶
while running:
renew_worker_lease()
claim_due_queued_tasks()
submit_claimed_tasks()
reconcile_in_flight_tasks()
detect_stuck_and_orphaned_running_tasks()
sleep(poll_interval)
Execution Adapters¶
Sync mode¶
- Executes callable in worker process.
- Useful for development/testing.
Ray Core mode¶
- Uses Ray remote execution directly.
- Lower submission overhead.
Ray Job mode¶
- Uses Ray Job Submission API.
- Worker submits a payload transport command:
python -m django_ray.runtime.entrypoint --payload-b64 <...>- Payload is URL-safe base64 JSON containing callable path and serialized args/kwargs.
- Workers can adopt orphaned persisted Ray Job handles from inactive workers and continue reconciliation instead of immediately retrying duplicate work.
Entrypoint Contract¶
django_ray.runtime.entrypoint:
- Bootstraps Django in Ray runtime.
- Decodes task payload.
- Imports callable and executes it.
- Returns JSON result envelope:
successresulterrortracebackexception_type
Reliability Controls¶
- Unified retry policy with denylist support (short and fully-qualified exception names).
- Worker lease heartbeat + cross-worker orphan recovery.
- Task monitor heartbeats for active reconciliation paths.
- Stuck/timeout detection with loss handling and retry path.
- Startup settings validation fail-fast by default, with migration/bootstrap bypass controls.
- Result size enforcement with configurable oversized-result backends (
digest,filesystem,s3,gcs). - Backend result retrieval rehydrates
result_referencepayloads for retrievable backends.
Observability Surfaces¶
- Django admin for task/lease inspection and operations.
- Worker logs for claim/submit/reconcile/retry events.
- Optional metrics/API surfaces in
testprojectexample app.