What's a good Linux OCR application that supports PDFs?

AgreeableLandscape · 4 years ago

What's a good Linux OCR application that supports PDFs?

unperson · 4 years ago

ocrmypdf does everything, it even compresses your pdf for uploading if you add the -O2 or -O3 flag.

If it’s a raw scan, clean it up with scantailor first. Also experiment with various --tesseract-pagesegmodesettings for best results. I use 6 for most books, 3 is there are multiple columns, 1 if there are multiple languages.

AgreeableLandscape · 4 years ago

Does it support mixed text and image PDFs? I’ve found similar apps but they only support fully image PDFs.

unperson · 4 years ago

There are options for several cases:

  --remove-vectors      EXPERIMENTAL. Mask out any vector objects in the PDF
                        so that they will not be included in OCR. This can
                        eliminate false characters.
  -f, --force-ocr       Rasterize any text or vector objects on each page,
                        apply OCR, and save the rastered output (this rewrites
                        the PDF)
  -s, --skip-text       Skip OCR on any pages that already contain text, but
                        include the page in final output; useful for PDFs that
                        contain a mix of images, text pages, and/or previously
                        OCRed pages
  --redo-ocr            Attempt to detect and remove the hidden OCR layer from
                        files that were previously OCRed with OCRmyPDF or
                        another program. Apply OCR to text found in raster
                        images. Existing visible text objects will not be
                        changed. If there is no existing OCR, OCR will be
                        added.

AgreeableLandscape · 4 years ago

Thanks!