I have a PDF which contains some scanned documents and screenshots, whose contents I want to be able to search for. Is there software for Linux that can apply OCR to images in a PDF and make them searchable in a PDF reader?
You must log in or # to comment.
ocrmypdf does everything, it even compresses your pdf for uploading if you add the -O2 or -O3 flag.
If it’s a raw scan, clean it up with scantailor first. Also experiment with various
--tesseract-pagesegmode
settings for best results. I use 6 for most books, 3 is there are multiple columns, 1 if there are multiple languages.Does it support mixed text and image PDFs? I’ve found similar apps but they only support fully image PDFs.
There are options for several cases:
--remove-vectors EXPERIMENTAL. Mask out any vector objects in the PDF so that they will not be included in OCR. This can eliminate false characters. -f, --force-ocr Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF) -s, --skip-text Skip OCR on any pages that already contain text, but include the page in final output; useful for PDFs that contain a mix of images, text pages, and/or previously OCRed pages --redo-ocr Attempt to detect and remove the hidden OCR layer from files that were previously OCRed with OCRmyPDF or another program. Apply OCR to text found in raster images. Existing visible text objects will not be changed. If there is no existing OCR, OCR will be added.
Thanks!