I have a PDF which contains some scanned documents and screenshots, whose contents I want to be able to search for. Is there software for Linux that can apply OCR to images in a PDF and make them searchable in a PDF reader?

  • unperson
    link
    fedilink
    arrow-up
    6
    ·
    4 years ago

    ocrmypdf does everything, it even compresses your pdf for uploading if you add the -O2 or -O3 flag.

    If it’s a raw scan, clean it up with scantailor first. Also experiment with various --tesseract-pagesegmodesettings for best results. I use 6 for most books, 3 is there are multiple columns, 1 if there are multiple languages.

    • AgreeableLandscapeOPM
      link
      fedilink
      arrow-up
      2
      ·
      4 years ago

      Does it support mixed text and image PDFs? I’ve found similar apps but they only support fully image PDFs.

      • unperson
        link
        fedilink
        arrow-up
        3
        ·
        4 years ago

        There are options for several cases:

          --remove-vectors      EXPERIMENTAL. Mask out any vector objects in the PDF
                                so that they will not be included in OCR. This can
                                eliminate false characters.
          -f, --force-ocr       Rasterize any text or vector objects on each page,
                                apply OCR, and save the rastered output (this rewrites
                                the PDF)
          -s, --skip-text       Skip OCR on any pages that already contain text, but
                                include the page in final output; useful for PDFs that
                                contain a mix of images, text pages, and/or previously
                                OCRed pages
          --redo-ocr            Attempt to detect and remove the hidden OCR layer from
                                files that were previously OCRed with OCRmyPDF or
                                another program. Apply OCR to text found in raster
                                images. Existing visible text objects will not be
                                changed. If there is no existing OCR, OCR will be
                                added.