Sample size was 1000 jobs per prompt/model. We run them once per month to detect regression as well.
Sounds like someone who's responsible, on the hook, for a bunch of processes, repeatable processes (as much as LLM driven processes will be), operating at scale.
Just in the open, tools like open-webui bolts on evals so you can compare: how different models, including new ones, perform on the tasks that you in particular care about.
Indeed LLM model providers mainly don't release models that do worse on benchmarks—running evals is the same kind of testing, but outside the corporate boundary, pre-release feedback loop, and public evaluation.
https://chatgpt.com/share/69aa1972-ae84-800a-9cb1-de5d5fd7a4...