undefined

points

[-]

Is anyone familiar with gotree? That was mentioned as the most complex piece of code, but the metric was LOC. Based on the high level description gotree might be closer to a set of small programs / algorithms.

Interesting anyway. It will be nice to see these comparisons with open weight models and how do those fare.

by tadamcz8 minutes ago|

parent|

[-]

There's a more detailed description in "Appendix B: Qualitative discussion of the gotree task"

https://epoch.ai/blog/mirrorcode-preliminary-results#appendi...

by VladVladikoff1 hours ago|

prev|

[-]

I would love to try this out. I have a horrible legacy project that is written in angular by a really amateur developer, full of huge blocks of copy pasted code that has minor modifications in each block. I’ve tried before to get an LLM to rewrite it to something more sensible, but I have not succeeded, usually it just ends up breaking everything. Is there a guide or some system to follow? What’s the best way to accomplish a task like this?

by stingraycharles1 hours ago|

prev|

[-]

Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.

Same with SWE-bench and others.

by LeCompteSftware1 hours ago|

prev|

[-]

Surely the biggest difference is that you guys are mostly testing LLMs on simpler utilities, mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal. In fact the only program you tested which actually has anywhere close to the complexity of SQLite or FFmpeg is is Pkl, and it looks like Opus 4.6 totally failed.

I think your results are consistent. You're just measuring different things. Your benchmarks mostly tests LLMs ability to write technically routine programs of moderate length - yes the bioinformatics package involves specialized domain knowledge, but not specialized Go engineering. ProgramBench is harder.

by tadamcz54 minutes ago|

parent|

[-]

I don't think so. ProgramBench authors say no LLMs fully resolve any task, i.e. even the easiest tasks in their benchmark are unsolved. Whereas we found Opus 4.6 successfully reimplements almost every program up to gotree’s size (around 15-20 of them).

For Pkl, the preliminary results only went up to 1bn total tokens (costing $550, which would be cheap if LLMs could do the task). It might very well be solved at higher token budgets; see the report for more discussion of this.

The preliminary results are just on 4 targets. We have several Pkl-level and harder tasks in the full set which we're releasing soon.

In the following quote multiple things are not quite right:

> mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

First, I think you're confusing the top-end of ProgramBench difficulty with the average. The quote in the OP is pretty clear that FFmpeg, SQLite, and PHP are the 3 hardest out of 200 in the benchmark:

> Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter.

Second, I don't see the relevance of C vs higher-level languages, how does this make ProgramBench harder?

Third, for the test cases, I think you might be labouring under a misapprehension about how MirrorCode works? MirrorCode uses end-to-end tests from a variety of sources (the original program’s test suites, real-world data, and LLM-assisted generation). End-to-end means the stdout/stderr has to match exactly for each test case.

by tadamcz16 minutes ago|

parent|

prev|

[-]

> Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal.

This is incidental to the main disagreement, but btw I also doubt this.

Let's try and make the claim more precise. e.g. are you saying the average university undergraduate studying CS would reimplement cal from scratch (only stdlib), matching the output perfectly for all 1365 MirrorCode test cases, in (say) 3 days of full-time work (without AI assistance obviously)? I'd bet against it!

Here is the manual for the cal that we use: https://media.githubusercontent.com/media/epoch-research/Mir...

You can also look at a full transcript of an LLM solving the task: https://epochai-public-eval-logs-manual.s3.amazonaws.com/eva...

by 25 minutes ago|

parent|

prev|

[-]

deleted