undefined

upvote

points

by kneel259 hours ago |

upvote

by zamadatix5 hours ago|

[-]

Somehow they did use this as part of their approach to get to 0 regressions across 65k tests + no performance regressions though + identical output for AST and bytecode though. How much manual review was part of the hundreds of rounds of prompt steering is not stated, but I don't think it's possible to say it couldn't find any deep logical errors along the way and still achieve those results.

The part that concerns me is whether this part will actually come in time or not:

> The Rust code intentionally mimics things like the C++ register allocation patterns so that the two compilers produce identical bytecode. Correctness is a close second. We know the result isn’t idiomatic Rust, and there’s a lot that can be simplified once we’re comfortable retiring the C++ pipeline. That cleanup will come in time.

Of course, it wouldn't be the first time Andreas delivered more than I expected :).

reply

upvote

by kneel252 hours ago|

[-]

That’s convincing and impressive, but I wouldn’t say it proves it can spot deep errors. If it’s incredible at porting files and comparing against the source of truth then finding complicated issues isn’t being tested imo.

reply

upvote

by zamadatix1 hours ago|

[-]

If completing the above successfully doesn't necessarily test these abilities then where does the concern about having these abilities come into play?

reply

upvote

by herrkanin9 hours ago|

[-]

Your argument is just as applicable on human code reviewers. Obviously having others review the code will catch issues you would never have thought of. This includes agents as well.

reply

upvote

by kneel259 hours ago|

[-]

They’re not equal. Humans are capable of actually understanding and looking ahead at consequences of decisions made, whereas an LLM can’t. One is a review, one is mimicking the result of a hypothetical review without any of the actual reasoning. (And prompting itself in a loop is not real reasoning)

reply

upvote

by iamleppert6 hours ago|

[-]

I keep hearing people say "but as humans we actually understand". What evidence do you have of the material differences in what understanding an LLM has, and what version a human has? What processes do we fundamentally do, that an LLM does not or cannot do? What here is the definition of "understanding", that, presumably an LLM does not currently do, that humans do?

reply

upvote

by mcpar-land5 hours ago|

[-]

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

reply

upvote

by kneel252 hours ago|

[-]

Well a material difference is we don’t input/output in tokens I guess. We have a concept of gaps and limits to knowledge, we have factors like ego, preservation, ambition that go into our thoughts where LLM just has raw data. Understanding the implication of a code change is having an idea of a desired structure, some idea of where you want to head to and how that meshes together. LLM has zero of any of that. Just because it can copy the output of the result of those factors I mention doesn’t mean they operate the same.

reply

upvote

by Fervicus9 hours ago|

[-]

With humans though, I wouldn't have to review 20k lines of code at once.

reply

upvote

by glhaynes8 hours ago|

[-]

So ask the AI to just translate one little chunk at a time, right?

reply

upvote

by Fervicus8 hours ago|

[-]

That's not what happened here though.

reply

upvote

by DetroitThrow9 hours ago|

[-]

>Your argument is just as applicable on human code reviewers.

The tests many of us use for how capable a model or harness is is usually based around whether they can spot logical errors readily visible to humans.

Hence: https://news.ycombinator.com/item?id=47031580

reply

upvote

by u_sama9 hours ago|

[-]

That is what the testing suite is there to check, no?

reply

upvote

by layer89 hours ago|

[-]

No. Testing generally can only falsify, not verify. It’s complementary to code review, not a substitute for it.

reply

upvote

by kneel259 hours ago|

[-]

You mean the testing suite generated by AI?

reply

upvote

by trflynn897 hours ago|

[-]

The primary JS test suite is maintained by the authors of the specification itself: https://github.com/tc39/test262

reply

upvote

by Jolter8 hours ago|

[-]

It isn’t, in this case.

reply

upvote

by u_sama7 hours ago|

[-]

No, a real test suite, either their own which they developped or the official ECMA one

reply

upvote

by cardanome9 hours ago|

[-]

Yeah, I lost all interest in the ladybird project now that it is AI slop.

No one wants to work with this generated, ugly, unidiomatic ball of Rust. Other than other people using AI. So you dependency AI grows and grows. It is a vicious trap.

reply