undefined

points

by hiAndrewQuinn3 days ago |

comments

by apothegm3 days ago|

[-]

And also something that it’s dangerous to try to do stochastically.

by hiAndrewQuinn3 days ago|

parent|

[-]

It's going to be stochastic in some sense whether you want it to be or not, human error never reaches zero percent. I would bet you a penny you'd get better results doing one two-second automated pass + your usual PII redaction than your PII redaction alone.

by ori_b10 hours ago|

parent|

[-]

The advantage of computers was that they didn't make human errors; they did things repeatedly, quickly, and predictably. If I'm going to accept human error, I'd like it to come from a human.

by creesch6 hours ago|

parent|

[-]

> The advantage of computers was that they didn't make human errors;

Sure they do, computers repeatedly, quickly, and predictably do what they are programmed to do. Which includes any human errors in that programming.

by laserlight6 hours ago|

parent|

[-]

> predictably do what they are programmed to do

And now they predictably do what they are not programmed to do.

by pigpag10 hours ago|

parent|

prev|

[-]

[dead]

by cyanydeez3 days ago|

parent|

prev|

[-]

I think the problem is most secrets arn't stochastic; they're determinant. When the user types in the wrong password, it should be blocked. Using a probabilistic model suggests an attacker only now needs to be really close, but not correct.

Sure, there's some math that says being really close and exact arn't a big deal; but then you're also saying your secrets don't need to be exact when decoding them and they absolutely do atm.

Sure looks like a weird privacy veil that sorta might work for some things, like frosted glass, but think of a toilet stall with all frosted glass, are you still comfortable going to the bathroom in there?

by CityOfThrowaway11 hours ago|

parent|

[-]

I dunno what use case you're thinking this is for.

The use case for this is that many enterprise customers want SaaS products to strip PII from ingested content, and there's no non-model way to do it.

Think, ingesting call transcripts where those calls may include credit card numbers or private data. The call transcripts are very useful for various things, but for obvious reasons we don't want to ingest the PII.

by traceroute668 hours ago|

parent|

[-]

> Think, ingesting call transcripts where those calls may include credit card numbers or private data. The call transcripts are very useful for various things, but for obvious reasons we don't want to ingest the PII.

Credit card numbers are deterministic. A five year old could write a script to strip out credit card numbers.

As for other PII ? You're seriously expecting an LLM to find every instance of every random piece of PII ? Worldwide ? In multiple languages ? I've got an igloo I'd like to sell you ...

by agnishom1 hours ago|

parent|

prev|

[-]

One could chain a regex based system together with this

by moralestapia2 days ago|

parent|

prev|

[-]

The alternative being?

by hiAndrewQuinn10 hours ago|

prev|

[-]

For the confused: this link must have gotten revived or something, I posted this comment a few days ago. Looks like it's getting the accolades I claim it deserves now.

by tanelpoder9 hours ago|

parent|

[-]

It was put into second-chance pool by moderators. I originally submitted this link a few days ago and today got this (semi?)automated email from HN, an excerpt below:

  The submission "OpenAI Privacy Filter" that you posted to Hacker News (https://news.ycombinator.com/item?id=47870901) looks good, but hasn't had much attention so far. We put it in the second-chance pool, so it will get a random placement on the front page some time in the next day or so.

  This is a way of giving good HN submissions multiple chances at the front page. If you're curious, you can read about it at https://news.ycombinator.com/item?id=26998308 and other links there.

by Fraaaank4 hours ago|

prev|

[-]

From a compliance POV it's not enough. For example: "<NAME PERSON ONE> is president of the United States" is still identifiable even though the name has been redacted.

Since you can't be 100% certain that a filter redacts all personal data, you'd have to make sure that you have measures in place which allow OpenAI to legally process personal data on your behalf. Otherwise you'd technically have a data breach (from a GDPR pov).

And if OpenAI can legally process personal data on your behalf, why bother filtering if processing with filtering is also compliant?

by ashwindharne3 days ago|

prev|

[-]

Same here, this is an incredibly useful thing to have in the toolkit