PDF Editor for repairing book scan OCR?

zabadoh · edit-2 1 year ago

PDF Editor for repairing book scan OCR?

sibloure@beehaw.org · 1 year ago

I have had good results with Tesseract. I had to export the PDF to individual jpegs, then batch OCR’d them with tesseract, then merged the individual pages back into a single PDF. If you don’t want to use command line and are okay with it not being open source, PDF24.org does a good job and does not charge.

ChickenBoo@lemmy.jnks.xyz · edit-2 1 year ago

If you want to host it locally, Stirling PDF can be run in docker, and uses a library that uses Tesseract. Has a bunch of other handy PDF operations, too. I keep it around for the two times a year I need to merge, split, or decrypt PDFs.

https://github.com/Frooodle/Stirling-PDF/blob/main/HowToUseOCR.md

It can do it straight from PDF and do multiple files at a time.

sibloure@beehaw.org · 1 year ago

This is amazing. Did not realize it existed. Thank you for sharing

wvstolzing · edit-2 1 year ago

Another vote for Tesseract – just to clarify the terminology, though: PDF is a fragile format best used read-only; so you really don’t want to edit a pdf, but make a new one using the same (or cleaned-up) bitmaps and a new ocr text layer.

Now, tesseract is excellent at recognizing glyphs; but especially if the scanned image is a little fuzzy, the layout detection falters; and when it falters, you get redundant line breaks, & chunks of text in the wrong order – all of which gets incredibly annoying for searching & copying purposes. So if you can spare the time, and the text requires it, you may need to mark regions (paragraphs & titles mainly) on the bitmap image manually. There exist a few frontends to Tesseract that help with a task like that; check out, e.g., https://github.com/manisandro/gImageReader - inside single paragraph blocks of text, Tesseract doesn’t get as easily confused; and the text output is in the correct reading order, & w/o redundant breaks.