Operational Records for Long-Running Experiments

Long runs need an operating record

Long-running experiments usually fail in ordinary ways. A script is missing on the worker, a checkpoint path is wrong, the environment differs from the developer machine, or a worker disappears after the run has already started. After the failure, hours can disappear into reconstructing what happened.

The control layer turns each run into an operating record, starting before launch, following the job through placement and execution, and remaining useful after the result has moved into a report or artifact store.

Scheduler ownership

Experiment logic belongs in the experiment script. The scheduler has a narrower job: accept well-formed work, reject broken requests early, place work on a capable target, track state, and preserve the evidence needed to review the result.

Validate before launch. Check scripts, configuration shape, checkpoint paths, hardware requirements, and target environment before compute is committed.

Place work deliberately. Match CPU and GPU jobs to workers by memory profile, accelerator class, runtime tags, software dependencies, and data locality.

Preserve traceability. Record code revision, config, dataset references, environment, worker, artifacts, and terminal state.

Fail explicitly. Lost workers, blocked validation, exhausted retries, and operator cancels land in named states with readable reasons.

Validation before launch

Submit-time validation catches a large class of avoidable failures. The worker image can be checked for required files, the config can be parsed, the checkpoint can be resolved, and the requested runtime can be matched against worker capabilities before a job starts.

This changes how operators use the system. They can submit exploratory work knowing the queue will reject malformed runs with a specific reason before those runs occupy hardware and fail later with a stack trace buried in logs.

Placement and capacity

Placement is explicit enough that an operator can review it later. The record shows why a job landed where it did: accelerator class, memory headroom, runtime tags, software environment, queue state, and any locality constraint that mattered.

Specialized hardware needs explicit reservation policy. If lightweight work reserves a scarce target, the scheduler makes that choice visible as a policy decision, or refuses the placement until the request is corrected.

Traceability

Every run carries the code revision it executed, the resolved configuration, the dataset references, the environment, the worker identity, and the artifacts produced. Write the record incrementally; waiting until the end loses the most important context when a run fails halfway through.

A result that leaves the system keeps its run reference. A figure in a report leads back to the job, the job leads back to the source and config, and the source and config explain the code path that produced the result.

Local and remote workers

Local and remote targets share the same job states and record format, even when their transport details differ. Remote workers add handoff, liveness, and log-collection concerns, and those concerns belong in the normal execution path.

Heartbeat failure creates a clean terminal state or a controlled retry. The job leaves the running list with a reason attached, and the operator sees whether the worker disappeared, validation blocked launch, the command exited, or a retry policy stopped further attempts.

Recovery

Queues and worker state need restart-safe storage with inspectable contents. After a supervisor restart, the system reconciles intended work with live processes from a known point. A recovery path that relies on operator inference from raw logs has already failed.

Recovery is an operator action. Stop, restart, retry, cancel, and review work through documented command paths, and the system reports what changed after each action.

Failure semantics

Failure type determines the next step. A validation rejection needs a corrected request. A transport error can warrant a retry with backoff. A missing file points to packaging or sync. An out-of-memory error needs visibility because another automatic attempt may repeat the same failure.

Named terminal states make triage faster: completed, canceled, validation_failed, worker_lost, command_failed, retry_exhausted. The exact names matter less than the discipline of keeping them distinct.

What the record provides

Good experiment operations leave behind finished jobs and a body of work that remains searchable, comparable, reviewable, and reusable away from the original run context. That record survives model swaps, hardware changes, and turnover in the people operating the system.