I have a PDF which contains some scanned documents and screenshots, whose contents I want to be able to search for. Is there software for Linux that can apply OCR to images in a PDF and make them searchable in a PDF reader?

  • AgreeableLandscapeOPM
    link
    fedilink
    arrow-up
    2
    ·
    4 years ago

    Does it support mixed text and image PDFs? I’ve found similar apps but they only support fully image PDFs.

    • unperson
      link
      fedilink
      arrow-up
      3
      ·
      4 years ago

      There are options for several cases:

        --remove-vectors      EXPERIMENTAL. Mask out any vector objects in the PDF
                              so that they will not be included in OCR. This can
                              eliminate false characters.
        -f, --force-ocr       Rasterize any text or vector objects on each page,
                              apply OCR, and save the rastered output (this rewrites
                              the PDF)
        -s, --skip-text       Skip OCR on any pages that already contain text, but
                              include the page in final output; useful for PDFs that
                              contain a mix of images, text pages, and/or previously
                              OCRed pages
        --redo-ocr            Attempt to detect and remove the hidden OCR layer from
                              files that were previously OCRed with OCRmyPDF or
                              another program. Apply OCR to text found in raster
                              images. Existing visible text objects will not be
                              changed. If there is no existing OCR, OCR will be
                              added.