I have a PDF which contains some scanned documents and screenshots, whose contents I want to be able to search for. Is there software for Linux that can apply OCR to images in a PDF and make them searchable in a PDF reader?

@unperson
link
69M

ocrmypdf does everything, it even compresses your pdf for uploading if you add the -O2 or -O3 flag.

If it’s a raw scan, clean it up with scantailor first. Also experiment with various --tesseract-pagesegmodesettings for best results. I use 6 for most books, 3 is there are multiple columns, 1 if there are multiple languages.

@AgreeableLandscape
mod
admin
creator
link
29M

Does it support mixed text and image PDFs? I’ve found similar apps but they only support fully image PDFs.

@unperson
link
39M

There are options for several cases:

  --remove-vectors      EXPERIMENTAL. Mask out any vector objects in the PDF
                        so that they will not be included in OCR. This can
                        eliminate false characters.
  -f, --force-ocr       Rasterize any text or vector objects on each page,
                        apply OCR, and save the rastered output (this rewrites
                        the PDF)
  -s, --skip-text       Skip OCR on any pages that already contain text, but
                        include the page in final output; useful for PDFs that
                        contain a mix of images, text pages, and/or previously
                        OCRed pages
  --redo-ocr            Attempt to detect and remove the hidden OCR layer from
                        files that were previously OCRed with OCRmyPDF or
                        another program. Apply OCR to text found in raster
                        images. Existing visible text objects will not be
                        changed. If there is no existing OCR, OCR will be
                        added.
@AgreeableLandscape
mod
admin
creator
link
29M

Thanks!

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word “Linux” in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

  • 0 users online
  • 42 users / day
  • 67 users / week
  • 130 users / month
  • 428 users / 6 months
  • 3990 subscribers
  • 1300 Posts
  • 3857 Comments
  • Modlog