Benchmarks are really fun—lots of secret ones. Our main thesis is that you should be using the same benchmarks to measure human ability to use a computer, as you would an AI model. Definitely a suite of continuous long term planning tasks (games) and things such as marking emails as spam etc.
definitely! we are looking into more interp + visualizations in general as we scale up.