undefined

points

[-]

Engines like Stockfish might have over 100 "search parameters" that need to be tuned, to my best knowledge SPSA is preferred because the computational cost typically does not depend on the number of parameters.

Or, if attempting to use SPSA to say, perform a final post-training tune to the last layers of a neural network, this could be thousands of parameters or more.

by LPisGood56 minutes ago|

parent|

[-]

Are there any other well known examples of SPSA being used in practice?

by mpolson644 hours ago|

parent|

prev|

[-]

The concern about the dimensionality of the search space is real, especially once things cross over into the 100s -- BO would certainly not be useful post-training the way the blog post talks about using SPSA.

That being said, it still seems possible to be that using a different black box optimization technique for a fairly constrained set of related magic numbers (say, fewer than 50) might lead to some real performance improvements in these systems, could be worth reaching out to the lc0 or stockfish development communities.

by LPisGood4 hours ago|

prev|

[-]

One thing I wonder is why design of experiments (DOE) methodology is so seldom used for these things.

Statisticians and operations researchers have spent a hundred years deciding how to do as few experiments as possible to tweak parameters in the ways that give the highest impact with statistical basis that the selections are good.

In the language of information and decision trees, these experiments are trying to in some sense “branch” on the entropy minimizing variables.

by agalunar4 hours ago|

parent|

[-]

SPRT is used religiously in engine development today. There is enormous incentive to test efficiently.

https://github.com/official-stockfish/fishtest/wiki/Fishtest...

by mpolson644 hours ago|

parent|

prev|

[-]

DOE is still very useful in many contexts, but when it's possible do use a sequential design these adaptive techniques really start to pull away in terms of optimization quality.

There's simply a lot of sample efficiency to gain by adapting the experiment to incoming data in a regime where one can repeatedly design n candidates, observe their effects, and repeat m times compared to a setting where one must design a fixed experiment with n*m samples.