undefined

points

[-]

> It's weird that Aisle wrote this.

No, writing an advertisement is not weird. What's weird is that it's top of HN. Or really, no, this isn't weird either if you think about it -- people lookin for a gotcha "Oh see, that new model really isn't that good/it's surely hitting a wall/plateau any day now" upvoted it.

by goekjclo3 hours ago|

parent|

[-]

It's not weird. Top of HN is worthless as a barometer at this point, people downvote for calling out AI slop.

by sanex3 hours ago|

parent|

prev|

[-]

Nah, Saturday post. Less news less content.

by SoftTalker3 hours ago|

prev|

[-]

It's also that humans are very bad at repetitive detailed tasks. Sitting down with a code base and looking at each function for integer overflow comparison bugs gets boring really fast. It's a rare person who can do that for as long as it takes to find a bug that they don't already have some clues about.

It's the flaw in the "given enough eyeballs, all bugs are shallow" argument. Because eyeballs grow tired of looking at endless lines of code.

Machines on the other hand are excellent at this. They don't get bored, they just keep doing what they are told to do with no drop-off in attention or focus.

by throwatdem123112 hours ago|

parent|

[-]

idk man, pay me enough money and I’ll look at as much code as you want looking for integer overflows

Would it be cheaper than Claude Mythos doing it? No idea. Maybe, maybe not.

But it’s weird how we’re willing to throw away money to a megacorp to do it with “automation” for potentially just as much if not more as it would cost to just have big bounty program or hiring someone for nearly the same cost and doing it “normally”.

It would really have to be substantially less cost for me to even consider doing it with a bot.

by tredre31 hours ago|

parent|

[-]

> idk man, pay me enough money and I’ll look at as much code as you want looking for integer overflows

So would I, but it doesn't negate that we, humans, are bad at this. We will get bored and our focus will begin to drift. We might not notice it, we might not want to admit it, but after a few continuous hours we will start missing things.

by ____tom____2 hours ago|

parent|

prev|

[-]

And there aren't enough security researchers in the world to review ALL the files from OpenBSD.

And if there were, the cost would be more like $20M than 20K.

Having all code reviewed for security, by some level of LLM, should be standard at this point.

by tombert2 hours ago|

prev|

[-]

It's weird, because when working on a big project, taking a break for a week or two, and returning to it, I will find a bug and will see hundreds of lines of code that are absolutely terrible, and I will tell myself "Tom you know better than to do this, this is a rookie mistake".

I think people forget that it's hard to be clever and tidy 100% of the time. Big programs take a lot of discipline and an understanding of the context that can be really hard to maintain. This is one of several reasons that my second draft or third draft of code is almost always considerably better than the first draft.

by kennywinker3 hours ago|

prev|

[-]

If it’s obvious when you look close, then automate looking close. Seems simple to write tools that spider thru a code base, finding logical groupings and feeding them into an LLM with prompts like “there is a vulnerability in this code, find it”.

The thesis is, the tooling is what matters - the tools (what they call the harness) can turn a dumb llm into a smart llm.

by tptacek3 hours ago|

parent|

[-]

Hold on, I misread your comment because I'm knee-jerk about code scanners, which were the bane of my existence for a while. Reworking... and: done. The original comment was just the first graf without the LLM qualification. Sorry about that.

The general approach without LLMs doesn't work. 50 companies have built products to do exactly what you propose here; they're called static application security testing (SAST) tools, or, colloquially, code scanners. In practice, getting every "suspicious" code pattern in a repository pointed out isn't highly valuable, because every codebase is awash in them, and few of them pan out as actual vulnerabilities (because attacker-controlled data never hits them, or because the missing security constraint is enforced somewhere else in the call chain).

Could it work with LLMs? Maybe? But there's a big open question right now about whether hyperspecific prompts make agents more effective at finding vulnerabilities (by sparing context and priming with likely problems) or less effective (by introducing path dependent attractors and also eliminating the likelihood of spotting vulnerabilities not directly in the SAST pattern book).

by bluGill1 hours ago|

parent|

[-]

I have long said that static checkers get ten false positives. note that size of the code is not a consideration, it doesn't matter if it the four line 'hello world' or the 10 million line monster some of us work on, it is ten max false positive.

by roywiggins3 hours ago|

parent|

prev|

[-]

Right, but they didn't actually test that, did they?

by kennywinker3 hours ago|

parent|

[-]

[dead]

by drc500free3 hours ago|

prev|

[-]

It’s like not differentiating between solving and verifying.

“PKI is easy to break if someone gives us the prime factors to start with!”

by tucnak2 hours ago|

prev|

[-]

The point of contention is whether Mythos is the product of its intelligence or its harness; the results like this, and other similar testimonies, call into question too-dangerous-to-release marketing, and for good reason, too. Because it is powerful marketing. Aisle merely says the intelligence is there in the small models. I say, it's already clear that competent defenders could viably mimic, or perhaps even eclipse what Mythos does, by (a) making better harness, (b) simply spending more on batch jobs, bootstrapping, cache better, etc. You may not be doing this yourself, but your probably should.

by tptacek2 hours ago|

parent|

[-]

Aisle and Anthropic are literally talking about two different problem spaces.