undefined

points

[-]

The company's product has its own classification model entirely dedicated to detecting unusual, dangerous prompt responses, and will redact or entirely block the model's response before it gets to the user. That's what their AIDR (AI Detection and Response) for runtime advertises it does, according to the datasheet I'm looking at on their website. Seems like the classification model is run as a proxy that sits between the model and the application, inspecting inputs/outputs, blocking and redacting responses as it deems fit. Filtering the input wouldn't always work, because they get really creative with the inputs. Regardless of how good your model is at detecting malicious prompts, or how good your guardrails are, there will always be a way for the user to write prompts creatively (creatively is an understatement considering what they did in this case), so redaction at the output is necessary.

Often, models know how to make bombs because they are LLMs trained on a vast range of data, for the purpose of being able to answer any possible question a user might have. For specialized/smaller models (MLMs, SLMs), not really as big of an issue. But with these foundational models, this will always be an issue. Even if they have no training data on bomb-making, if they are trained on physics at all (which is practically a requirement for most general purpose models), they will offer solutions to bomb-making.

by TerryBenedict188 days ago|

parent|

[-]

Right, so a filter that sits behind the model and blocks certain undesirable responses. Which you have to assume is something the creators already have, but products built on top of it would want the knobs turned differently. Fair enough.

I'm personally somewhat surprised that things like system prompts get through, as that's literally a known string, not a vague "such and such are taboo concepts". I also don't see much harm in it, but given _that_ you want to block it, do you really need a whole other network for that?

FWIW by "input" I was referring to what the other commenter mentioned: it's almost certainly explicitly present in the training set. Maybe that's why "leetspeak" works -- because that's how the original authors got it past the filters of reddit, forums, etc?

If the model can really work out how to make a bomb from first principles, then they're way more capable than I thought. And, come to think of it, probably also clever enough to encode the message so that it gets through...

by mpalmer189 days ago|

parent|

prev|

[-]

Are you affiliated with this company?

by 189 days ago|

prev|

[-]

deleted

by crooked-v189 days ago|

prev|

[-]

It knows that because all the current big models are trained on a huge mishmash of things like pirated ebooks, fanfic archives, literally all of Reddit, and a bunch of other stuff, and somewhere in there are the instructions for making a bomb. The 'safety' and 'alignment' stuff is all after the fact.