I have a PDF which contains some scanned documents and screenshots, whose contents I want to be able to search for. Is there software for Linux that can apply OCR to images in a PDF and make them searchable in a PDF reader?

  • unperson
    link
    fedilink
    arrow-up
    3
    ·
    4 years ago

    There are options for several cases:

      --remove-vectors      EXPERIMENTAL. Mask out any vector objects in the PDF
                            so that they will not be included in OCR. This can
                            eliminate false characters.
      -f, --force-ocr       Rasterize any text or vector objects on each page,
                            apply OCR, and save the rastered output (this rewrites
                            the PDF)
      -s, --skip-text       Skip OCR on any pages that already contain text, but
                            include the page in final output; useful for PDFs that
                            contain a mix of images, text pages, and/or previously
                            OCRed pages
      --redo-ocr            Attempt to detect and remove the hidden OCR layer from
                            files that were previously OCRed with OCRmyPDF or
                            another program. Apply OCR to text found in raster
                            images. Existing visible text objects will not be
                            changed. If there is no existing OCR, OCR will be
                            added.