undefined

points

[-]

From some minor historical experience with Mechanical Turk, I bet you could get humans to do this for one or two cents per receipt. You do them all three times for error checking for $0.03-$0.06 per receipt. I used to pay a nickel for much, much more than 5x this amount of transcription per job, and I got the feeling that I was overpaying based on how eagerly I got responses in and that I saw a lot of the same workers repeatedly.

These days, are MTurk workers simply feeding it into AI anyway, though? It's been a few years since I've run an MTurk campaign. At the time it was clear that humans were really doing it, as you get emails from the workers sometimes.

by PaulHoule7 hours ago|

prev|

[-]

My wife complains about people complaining about the price of eggs every time the subject comes up because it's her duty as a housewife to know about the price of all the protein sources and they are still a bargain -- who'd have thought that the price of transcribing the receipts would be seen as even more onerous?

Receipt scanning OCR has been around for a long time. Circa 2010 I ran enough HITs on Mechanical Turk [1] that I got my own account representative at AWS and I wondered what other kind of HITs other people were running and thought I would "go native" and try making $100 from Turk.

I am pretty good at making judgements for training sets, I have many times made data sets with 2,000-20,000 judgements; I can sustain the 2000 judgements/day of the median Freebase annotator and manage short burst much higher than that with mild perceptual side effects.

I gave up as a Turk though because the other HITs that were easy to find was the task of accurately transcribing cell phone snaps of mangled, damaged, crumpled, torn, poorly printed, poorly photographed or otherwise defective receipts. I can only imagine that these receipts had been rejected by a rather good classical OCR system. The damage was bad enough I could not honestly say I had done a 100% correct job on any single receipt, as I was being asked to do.

[1] in today's lingo: Multimodal with prompts like "Is this a photograph of an X?" and "Write a headline to describe this image"

by knollimar8 hours ago|

prev|

[-]

There's no way there wasnt a more efficient way of doing this. Way too many tokens per receipt.

I'd wager gemini flash could get decent results. Id be willing to try on 100 receipts and report cost

by andai2 hours ago|

parent|

[-]

I set up the same thing a few months ago with Flash, seemed to work fine. I didn't test more than a few receipts though (concluded that my spending wasn't a problem, I just needed to make more money lol), so can't vouch for the reliability at scale. But it handled really wrinkled, faded old receipts quite well.

by stavros9 hours ago|

prev|

[-]

One issue is that the human was less accurate than the LLM. The other is that the author probably didn't pay $1,500 for this, they probably paid $20 on a subscription.

by moron4hire8 hours ago|

prev|

[-]

You're counting just the egg-having receipts, but there were over 11 thousand receipts they had to go through to get to that 500-ish subset. I'm assuming OP wanted to process all of the receipts and then selected just eggs for a simple analytics job. With your rates, the human would cost almost $2000.

by cheschire6 hours ago|

parent|

[-]

Capturing the egg price from known egg receipts was the problem I was focused on, but you're right that there was also a filtering problem in the original spec. You get my upvote for continuing to make the problem interesting for me!

Had the filtering been done during the initial document storage, then the cost would have been much cheaper than your $2,000 estimate. Essentially binning the receipts based on "eggs" or "no eggs" would be free. But, crucially, what happens when the question changes from price per egg to price per gallon of milk? Now the whole stack would need to be sorted again. The $2,000 manual classification would need to be re-applied.

Isn't traditional ML-based classification cheaper for this problem at industrial scale than an LLM though? The OP did of course attempt more traditional generic off-the-shelf OCR tools, but let's consider proper bespoke industrial ML.

Just as a off-the-cuff example, I would probably start with building a tool that locates the date/time from a receipt and takes an image snip of it. Running ONLY image snips through traditional OCR is more successful than trying to extract text from an entire receipt. I would then train a separate tool that extracts images of line items from a receipt that includes item name and price. Yet another tool could then be trained to classify items based on the names of the items purchased, and a final tool to get the price. Now you have price, item, and date to put into your database.

Perhaps generating the training data to train the item classifier is the only place I could see an LLM being more cost effective than a human, but classifying tiny image snips is not the same as one-shotting an entire receipt. As an aside, if there's any desire to discuss how expensive training ML is, don't forget the price to train an LLM as well.

All of this is to say I believe traditional ML is the solution. I'm still not seeing the value prop of LLMs at the industrialization scale outside of very targeted training data generation. A more flippant conclusion might be that we can replace a lot of the parts of data science that makes PhD types get bored with creating traditional ML solutions.

by moron4hire4 hours ago|

parent|

[-]

Also, playing hotdog-not-hotdog on a receipt, looking for the price of eggs, and then entering them, is a very different job than the open-ended case of "enter all the relevant information from this receipt. There is large classification task that also has to take place to group name-brand items into generic categories (an open set that you don't know from the start) suitable for analyzing.

So, I've actually done similar work to this: getting paid piece-rate to manual enter data from paper invoices into an accounting system. It was so long ago I can't remember how fast I got at it, but it was way slower than 2 a minute/120 an hour. I doubt I got much more than a dozen an hour done. So, my gut reaction is that your estimate on the human cost is off by an order of magnitude.

by N_Lens9 hours ago|

prev|

[-]

AI has some weird unexpected uses that haven’t been fully uncovered yet, while it fails to scale or match the needed accuracy on expected usecases.

by extraduder_ire3 hours ago|

prev|

[-]

When I tried doing mechanical turk jobs out of curiosity, one of the tasks was checking/amending OCR'd receipts. (image on the left, textbox on the right)

It was less than a cent per receipt, but doing each was much quicker than 30 seconds. This was in 2017, to give you some idea how good OCR was.

Even before then, I've been disappointed no major chain encodes the receipt data into a QR code or something at the bottom of the receipt to side-step this whole thing. The closest you get is some places doing digital receipts nowadays.

by qoez8 hours ago|

prev|

[-]

Imagine how many 2001 era eggs he could have bought with that $101

by wolfram7410 hours ago|

prev|

[-]

I mean, at over 1000% the cost, the machine solution doesn't scale either?

by ProllyInfamous8 hours ago|

parent|

[-]

Not yet.

>>So I told Codex “we have unlimited tokens, let’s use them all,” and we pivoted to sending every receipt through Codex for structured extraction. From that one sentence, Codex came back with a parallel worker architecture - sharding, health management, checkpointing, retry logic. The whole thing. When I ran out of tokens on Codex mid-run, it auto-switched to Claude and kept going. I didn’t ask it to do that. I didn’t know it had happened until I read the logs.

----

For anybody still thinking my goodness, how wasteful is this SINGLE EXAMPLE: remember that all of the receipts from the article have helped better-train whichever GPT is deciphering all this thermalprinting.

For a small business owner (like my former self), paying $1500 to have an AI decipher all my receipts is still a heck of a lot cheaper than my accountant's rate. It would also motivate me to actually keep receipts (instead of throw-away/guessing), simply to undaunt the monumental task of recordskeeping.

----

>>But the runs kept crashing. Long CLI jobs died when sessions timed out. The script committed results at end-of-run, so early deaths lost everything. I watched it happen three times. On the fourth attempt I said “I would have expected we start a new process per batch.” That was the fix ... Codex patched it, launched it in a tmux session, and the ETA dropped from 12 hours to 3. Not a hard fix. Just the kind of thing you know after you’ve watched enough overnight jobs die at 3 AM.

>>11,345 receipts processed. The thing that was supposed to take all night finished before I went to bed.

by cheschire10 hours ago|

parent|

prev|

[-]

I think at a certain scale we're talking about switching to local trained models which don't have the same operating costs as running a frontier model for OCR. That would reduce the ongoing costs significantly. Might take longer than 30 seconds to read each receipt if you run multiple passes to ensure accuracy, but could run 24/7/365 without the same tax and administration overhead of humans.

Spherical cows aside though, I do agree with you that I should not consider scalability as a given.

by wolfram7410 hours ago|

parent|

[-]

I suppose if we had access to a public data set like this receipt bank, programmers could time themselves setting up a solution with off the shelf OCR algos. If they could clock in at under 10 hours they could advertise themselves as being "just as good as an LLM, but significantly cheaper." Downside for the managerial class that wants generative algos for the complete lack of legal protections.

by MarceliusK8 hours ago|

prev|

[-]

[dead]