Download every github repo
-> Classify if it could be used as an env, and what types
-> Issues and PRs are great for coding rl envs
-> If the software has a UI, awesome, UI env
-> If the software is a game, awesome, game env
-> If the software has xyz, awesome, ...
-> Do more detailed run checks,
-> Can it build
-> Is it complex and/or distinct enough
-> Can you verify if it reached some generated goal
-> Can generated goals even be achieved
-> Maybe some human review - maybe not
-> Generate goals
-> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
... Do the rest of the normal RL env stuffSo then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
this part is nontrivial though