undefined

points

by fithisux9 hours ago |

comments

by kube-system9 hours ago|

[-]

Since when is tamper resistance a part of PDF or any common image format?

by pwagland8 hours ago|

parent|

[-]

PDF files can be signed, that is tamper resistance. Tamper resistance doesn't have to make any difference to the readability of the document.

by kube-system8 hours ago|

parent|

[-]

So can any type of file -- that doesn't have any relevance to the supposed design of every file type in existence. Now, later versions of PDF do have explicit support for signatures, but what does this have to do with preventing OCR? OCR reads a file, it doesn't change the original file.

by ranger_danger8 hours ago|

parent|

[-]

Some OCR solutions do change the original file, like OCRmyPDF. They take layers that were just images before and replace it with text layers so that you can search the document.

by kube-system8 hours ago|

parent|

[-]

That isn't OCR, but an application of the resulting output of OCR. Again, a signature on a PDF or any type of file doesn't prevent you from reading it. (It also doesn't technically prevent you from changing it, it just enables the detection of changes to a particular file.)

There's nothing about PDFs or image formats that prevent anyone from doing OCR. The reason construction documents are difficult to OCR is because OCR models are not well trained for them, and they're very technical documents where small details are significant. It doesn't have anything to do with the file format

by fithisux8 hours ago|

parent|

prev|

[-]

True but you can make modified copies if you reverse engineer it with OCR.

by jimjimjim3 hours ago|

parent|

[-]

That's not really what I would call reverse engineering. If you read a pdf, and type it into word is that reverse engineering? Either way whatever you get is in no way going to convince anybody that it is the original.

by ranger_danger8 hours ago|

parent|

prev|

[-]

Can't one just remove the signature and re-sign it with anything else after tampering? Who verifies PDFs that hard?

by kube-system8 hours ago|

parent|

[-]

If you're performing OCR, you're almost by definition, disregarding the source file. The whole point of OCR is to be transformative.

by fithisux8 hours ago|

parent|

prev|

[-]

You can't change a PDF, it is by design to be not easy to OCRed

by kube-system7 hours ago|

parent|

[-]

PDFs are merely an collection of objects, that can be plainly read by reading the file -- some of those are straight up plain text that doesn't even need to be OCR'd, it can be simply extracted. It is also possible to embed image objects in PDFs, (this is common for scanned files) which might be what you are thinking of. But this is not a design feature of PDF, but rather the output format of a scanner: an image. Editing PDFs is a simple matter of simply editing a file, which you can do plainly as you would any other.

by jimjimjim3 hours ago|

parent|

prev|

[-]

It is not by design! PDFs that are made from scanned documents or collections of images would require OCRing but that is true of any format that the scans/images are put into. These days the vast majority of PDFs do not need to be OCRed as the pages are just made up of text, line drawings and images. And although it can get tricky you can edit those text, line and image commands as much as you want.

For example: add this is in the contents stream for a pdf page and it'll put hello world on the page

  BT
    /myfont 50 Tf
    100 200 Td
    (Hello World) Tj
  ET

(Note: a bit more is required to select the font etc)