🛞 CartPole Policy Optimizer

Train a 5-parameter linear controller to balance a pole on a moving cart, by direct search over the policy weights. No gradients, no value functions — just a black-box optimiser hunting a 5-D parameter vector whose only signal is the average survival time across a handful of stochastic episodes.

Algorithm

Pick a HumpDay optimizer: Evaluation budget:

Mean return0

Policies tried0

Best so far—

w_x—

w_ẋ—

w_θ—

w_θ̇—

bias—

Leaderboard (this session)

Each row records the mean return of the best policy a given algorithm found in one run. The max possible is 500 (the length of a CartPole-v1 episode). The gap between an algorithm's best training return and an independent 20-episode test mean is the noise-overfitting tax.

Algorithm	Best mean return	Policies used	Test mean (20 fresh rollouts)
— no runs yet —

What's happening

The cart-pole simulator runs the same CartPole-v1 dynamics that the OpenAI Gym ships (semi-implicit Euler integration of a 4-state rigid-body system: cart position, cart velocity, pole angle, pole rate). Each "policy" is a vector of 5 numbers — 4 weights and a bias — defining a linear controller: action = +push if w·state + b > 0.

Each evaluation runs 8 episodes from random initial conditions and returns the mean number of frames the pole stayed up. HumpDay minimises, so the objective is the negative mean return. A policy that consistently keeps the pole up for the full 500-step limit scores −500; one that crashes after 10 frames scores −10.

Why is direct search effective here? The reward function — sum of step indicators — is not differentiable in the policy parameters. Policy-gradient estimators must approximate gradients by perturbation, which has high variance. A well-chosen black-box optimiser exploring a 5-D space can find a balanced policy in a few hundred episodes, where REINFORCE typically needs millions.

Watch the search montage to see why population-based methods (CMA-ES, DE, GA) tend to dominate this objective: the landscape has a wide flat plateau where most policies crash within 20 steps (cost ≈ −20), then a sharp jump to "balanced for 500 steps". Local searchers struggle to escape the plateau; population methods reach it via diversity.

🌱 Save the Planet

If your hyper-parameter searches are heating the Earth, drop this in Cursor or Claude:

Read https://raw.githubusercontent.com/microprediction/humpday/main/SKILL.md
and create a project skill from it.

View SKILL.md →