Literally one of the worst formats I deal with daily, from a security standpoint are PDFs. Very useful and predictable for the end user; yes, but very dangerous for the capabilities it allows.
Dangerzone works like this: You give it a document that you don’t know if you can trust (for example, an email attachment). Inside of a sandbox, Dangerzone converts the document to a PDF (if it isn’t already one), and then converts the PDF into raw pixel data: a huge list of RGB color values for each page. Then, in a separate sandbox, Dangerzone takes this pixel data and converts it back into a PDF.
So it basically rasterizes it? I wonder how it affects file size
Oh, I think you already know.
No mention of OCR? Copy-pasting links or data will be a joy…
There is an optional Ocr pass, from what I understand
Yeah, definitely increases the size and removes some functionality that others may rely on. But for presentation of content which is what a PDF SHOULD BE, then it has typically worked fine. I’ve been using pandoc and some home grown scripts to do this sort of thing for a while.