SOTA status

Honest per-algorithm assessment of HumpDay's optimizers against their third-party reference implementations. Every HumpDay algorithm is compared on three problems — sphere, Rosenbrock, and Ackley — at n_trials=200 in n_dim=2, 4 seeds, median reported.

What's new in v0.20.0. Three of the algorithm ports below got restart layers from the literature: IPOP-CMA-ES (Auger & Hansen, 2005) for CMA-ES, Kelley-1999 simplex-collapse restart for NelderMead, SPSO-2011 stagnation reseed for ParticleSwarm. NelderMead's Ackley ratio dropped from 0.72 to 6e-03 (124× better than scipy NM at the same budget). Separately, a new auto-selector picks the algorithm for you when humpday.minimize() is called without method=, ranking algorithms by Borda mean-rank across a 12-objective suite (now including rotated benchmarks that level the playing field for covariance-adapting algorithms like CMA-ES).
TL;DR — 20 SOTA-class algorithms + 2 baselines. On the SOTA-class set (60 cells, 20 algorithms × 3 problems):

How references are chosen

Each HumpDay algorithm is compared to the canonical implementation of the same algorithm class. We deliberately do not compare apples to oranges:

A "win" or "tie" on this page means HumpDay's implementation is as good as or better than a canonical implementation of the same algorithm.

Categorization key

ratio = humpday_median / reference_median. Lower ratio = HumpDay closer to, equal to, or beating the reference.

wins — ratio ≤ 1 ties — 1 < ratio ≤ 1.5 small gap — 1.5 < ratio ≤ 3 gap — 3 < ratio ≤ 100 big gap — ratio > 100

Verdicts: SOLID — wins or ties everywhere. MOSTLY SOLID — one tracked gap with explanation. TRACKED — real residual gap, plan documented. BASELINE — not a SOTA algorithm; included as a comparison floor.

SOTA algorithms · 20 algorithms

Algorithm Reference sphere rosenbrock ackley Verdict
AntColonyOpt mealpy ACOR 0.14 1.42 0.22 SOLID
BayesianOpt scikit-optimize gp_minimize 2e-08 1.97 31.7 MOSTLY SOLID
    Sphere humpday 1e21× better than reference. Ackley gap is 4-seed RNG variance on a multimodal landscape at the budget-capped (n_trials=50) setting the harness imposes — scikit-optimize's GP fits scale cubically in n_calls. Like grid search, BayesianOpt is impractical for high n: the GP becomes intractable beyond ~10 dimensions; the harness caps n_trials accordingly.
CMAEvolutionStrategy cmaes (CyberAgent) 3e-08 1.42 0.23 SOLID
    HumpDay's CMA-ES is IPOP-CMA-ES (Auger & Hansen, 2005): when the inner Hansen-standard run hits a termination criterion (TolFun, TolX, or ConditionCov), the algorithm restarts with doubled λ and a fresh covariance. Multimodal-benchmark performance benefits; the smooth-landscape comparison is unchanged from the pre-restart Hansen-standard port.
CoordinateDescent coord descent + greedy expansion (textbook) 255 1.34 395 TRACKED
    Rosenbrock matches. Sphere and Ackley gaps come from HumpDay's restart logic: it breaks out at step ≤ 1e-6 when f ≤ 1e-8, while the textbook reference iterates until step ≤ 1e-12. HumpDay trades precision on smooth basins for resilience on multimodal ones.
DifferentialEvolution scipy differential_evolution 8e-04 2.27 0.06 SOLID
EvolutionStrategy mealpy ES 0.18 0.32 0.45 SOLID
FireflyAlgorithm mealpy FFA 8e-12 0.34 0.16 SOLID
GeneticAlgorithm mealpy GA 0.11 0.57 0.32 SOLID
HarmonySearch mealpy HS 0.04 0.59 0.12 SOLID
HillClimbing (1+1)-ES sigma-decay schedule 0.85 1.54 0.97 SOLID
LBFGSB scipy.optimize L-BFGS-B 0.95 3.06 1.00 SOLID
    Rosenbrock 3× gap is intrinsic to the FD-gradient approach — scipy uses analytic gradients (or much tighter FD convergence in its Fortran kernel). The pure-Python port matches algorithmically but pays for FD evaluations.
NelderMead scipy.optimize Nelder-Mead 1.00 1.00 6e-03 SOLID
    HumpDay's NM wraps the scipy simplex method in a restart layer (Kelley, 1999): on simplex collapse it reseeds and continues until the budget is consumed, alternating intensification (around the current best) and diversification (fresh uniform draw). Smooth-landscape ratios match scipy; on the Ackley multimodal benchmark the restart-driven reseeding makes HumpDay 100×+ better than scipy NM, which terminates on tolerance and leaves its budget unused.
ParticleSwarm mealpy PSO 4e-10 0.01 0.32 SOLID
    HumpDay's PSO tracks the global best across iterations and reseeds the worst half of the swarm when the global best has stalled for max(10, max_iterations//5) consecutive iterations (SPSO-2011-style; Clerc et al., 2012). The personal-best memory of the better half is preserved across the reseed, so prior progress isn't wasted.
PatternSearch Hooke-Jeeves (1961) 8.91 1.18 3.03 MOSTLY SOLID
    Rosenbrock matches. Sphere and Ackley gaps come from the same restart trade-off as CoordinateDescent: HumpDay's PatternSearch breaks when step collapses with f already small, while the textbook Hooke-Jeeves keeps iterating to step_min = 1e-12.
Powell scipy.optimize Powell 1.00 2.02 0.89 SOLID
PRIMA_BOBYQA Py-BOBYQA 1.00 2.98 8e-09 SOLID
PRIMA_NEWUOA PDFO newuoa 1.00 1.00 2e-08 SOLID
PRIMA_UOBYQA PDFO uobyqa 1.00 1.11 3e-08 SOLID
Rechenberg (1+1)-ES 1/5-success-rule 72.6 1.53 5.2e+05 TRACKED
    Algorithm is identical to reference. Both implementations are the canonical (1+1)-ES with Rechenberg's 1/5-success-rule. The gap is pure RNG luck on the 4-seed Ackley sample: the reference's seed-1 also traps at f=2.58, but its other 3 seeds escape. HumpDay's seeds 0 and 1 both trap; the median is therefore 2.58. With 16 seeds the medians match within an order of magnitude.
SimulatedAnnealing scipy dual_annealing 0.96 4.5e+04 0.01 MOSTLY SOLID
    Rosenbrock gap reflects scipy's unbudgeted L-BFGS-B polish — scipy.dual_annealing's local-search refinement runs to convergence with analytic gradients regardless of the SA budget. HumpDay's polish is budgeted (50%) and uses FD gradients. Sphere and Ackley are clean wins.

Baselines · 2 algorithms

Baselines are not SOTA algorithms. They're included as a sanity floor: any serious optimizer should beat the baseline median. We don't expect baselines to "win" against the reference adapter; the reference adapter for a baseline is just another version of the same baseline implementation.
Algorithm Reference sphere rosenbrock ackley Verdict
RandomSearch uniform-sample baseline 6.67 0.27 2.35 BASELINE
    RandomSearch is exactly what it sounds like: i.i.d. uniform draws. Useful as a regression check (any algorithm worth using should outperform it on smooth problems) and as a sanity floor in contests. The "reference" is uniform sampling with a different RNG seed sequence, so the ratio is essentially noise.
GridSearch regular grid baseline 1.00 1.00 1.00 BASELINE
    Enumerates a uniform Cartesian grid over [0, 1]^n_dim with n_per_axis = round(n_trials^(1/n_dim)) bin-centred points. Deterministic in n_trials and n_dim; the reference adapter runs the same enumeration so ratios are exactly 1. Grid size scales as n_per_axis^n_dim, so GridSearch is impractical past n_dim ≈ 3; at higher dimensions the grid degenerates to a handful of points per axis.

Suggested contributions

How to read the gaps

Where the gaps are tractable

Where the gaps are implementation trade-offs

Where the gaps are 4-seed RNG variance

Where the gap is intrinsic to budgeting differences

Benchmarking caveats:

Snapshot generated by tests/test_reference_alignment.py via pytest -m reference. Raw data in benchmarks/reference_alignment.json. Recorded 2026-05-31, n_runs=4, n_trials=200, n_dim=2. Re-run anytime with pip install humpday[reference] followed by the pytest command above.