Separate control research from flight evidence
Classical controllers remain the default for good reasons: they are inspectable, fast to tune inside a known envelope, and easy to bound. Learned controllers become interesting when the operating envelope is wider than a hand-tuned controller can comfortably cover.
A policy can look competent in simulation while the evidence package remains weak. Review starts with the record: what was trained, what was tested, what assumptions the simulator made, and what interface the flight stack will receive.
The pipeline pieces
Simulation environment. Dynamics, actuator limits, weather, sensor noise, bias, timing, and randomized initial conditions.
Training pipeline. Baseline controller data, reward definitions, curriculum changes, seeds, and training configuration.
Evaluation gates. Fixed tests on a held-out distribution with clear pass, fail, and blocked states.
Export pipeline. Observation contracts, action bounds, runtime checks, manifests, and flight-stack integration tests.
Each piece needs its own record. Training curves show optimizer progress. Simulator records show the surface the candidate saw. Export records show the runtime contract. Evaluation records tie those views together and state the validated envelope.
Simulation assumptions
A useful simulator states its limits. It names the actuator model, motor saturation behavior, atmosphere assumptions, sensor rates, noise profiles, latency assumptions, and wind model used during training and evaluation.
The evaluation distribution stresses those assumptions before hardware work begins. Narrow simulation produces narrow evidence. Broad randomization helps, and the report still needs to say where the model is expected to hold and which conditions remain untested.
The uncomfortable cases belong in the evaluation plan early: sensor dropout, actuator saturation, gust response, delayed observations, payload changes, and recovery from awkward initial attitudes. A simulator missing those cases may still support research. Flight handoff needs additional evidence from other tests.
Warm starts and baselines
A tuned PID or cascaded controller can provide baseline behavior for behavior cloning. The learned policy starts from recorded control behavior, with stability already represented in the data, before reinforcement learning begins.
The baseline remains useful after the warm start. Compare the learned policy against the controller it learned from on the held-out distribution, and let a regression against that baseline block promotion until the cause is understood.
Reward changes need records
Reward shaping is part of the experiment. Changes to reward components can make a policy look better while moving it away from the behavior the evaluation gate is meant to protect.
Each candidate records reward components, hyperparameters, training distribution, curriculum stage, seeds, and checkpoint identity. When a candidate passes, the record is detailed enough for another operator to reproduce the training conditions and rerun the evaluation.
Evaluation gates
Candidate policies run against tests that measure tracking, stability, control effort, disturbance response, and task completion under held-out conditions. The gate result stays simple: pass, fail, or blocked for missing evidence.
The held-out distribution is a protected asset. Keep it versioned, owned outside the training loop, and out of the training data. If training and evaluation drift apart, the report identifies the drift before new candidates are promoted.
Flight-stack handoff
Export checks validate observation dimensions, field order, control frequency, normalization values, action bounds, and model architecture. A mismatch stops the export with a named reason.
Software-in-the-loop testing exercises the exported runtime bindings and the training framework as separate surfaces. The flight stack keeps hard guarantees such as geofencing, attitude limits, battery failsafes, and arm/disarm logic, with the learned policy operating inside those controls.
The handoff contract is deliberately plain. It says what observations arrive, how often they arrive, how they are normalized, what actions can be emitted, and what runtime checks will reject an unsafe or malformed artifact. Plain contracts are easier to test, and they leave less room for a policy that worked in training to become a different system at integration time.
Operating envelope
A learned policy is cleared for the conditions covered by its training, evaluation, and handoff records. Wider use requires new evidence, and the release package states the validated envelope, known gaps, and required follow-on tests.
Flight logs close the loop between simulator and hardware. Passes and failures feed the next training distribution as labeled evidence that can be searched, reviewed, and reused.