Train a 5-parameter linear controller to balance a pole on a moving cart, by direct search over the policy weights. No gradients, no value functions β just a black-box optimiser hunting a 5-D parameter vector whose only signal is the average survival time across a handful of stochastic episodes.
Each row records the mean return of the best policy a given algorithm found in one run. The max possible is 500 (the length of a CartPole-v1 episode). The gap between an algorithm's best training return and an independent 20-episode test mean is the noise-overfitting tax.
| Algorithm | Best mean return | Policies used | Test mean (20 fresh rollouts) |
|---|---|---|---|
| β no runs yet β | |||
The cart-pole simulator runs the same CartPole-v1 dynamics that the
OpenAI Gym ships (semi-implicit Euler integration of a 4-state
rigid-body system: cart position, cart velocity, pole angle, pole
rate). Each "policy" is a vector of 5 numbers β 4 weights and a bias
β defining a linear controller: action = +push if wΒ·state + b > 0.
Each evaluation runs 8 episodes from random initial conditions and returns the mean number of frames the pole stayed up. HumpDay minimises, so the objective is the negative mean return. A policy that consistently keeps the pole up for the full 500-step limit scores β500; one that crashes after 10 frames scores β10.
Why is direct search effective here? The reward function β sum of step indicators β is not differentiable in the policy parameters. Policy-gradient estimators must approximate gradients by perturbation, which has high variance. A well-chosen black-box optimiser exploring a 5-D space can find a balanced policy in a few hundred episodes, where REINFORCE typically needs millions.
Watch the search montage to see why population-based methods (CMA-ES, DE, GA) tend to dominate this objective: the landscape has a wide flat plateau where most policies crash within 20 steps (cost β β20), then a sharp jump to "balanced for 500 steps". Local searchers struggle to escape the plateau; population methods reach it via diversity.
If your hyper-parameter searches are heating the Earth, drop this in Cursor or Claude:
Read https://raw.githubusercontent.com/microprediction/humpday/main/SKILL.md and create a project skill from it.