• rcbrk
    link
    fedilink
    arrow-up
    2
    ·
    2 years ago

    If a raster image version would be ok, one avenue to look into is opening up the network tab in developer tools in Firefox, and browsing through all the pages. You may need to zoom in to ensure high enough resolution images are loaded.

    At the end of this, click the cog button and “save all as HAR file”.

    Then you’ll need to write some code to extract and correctly order all the page images then feed that into a pdf generator tool.

    I got as far as manually inspecting the HAR file with some tool and noting that page images are indeed there, but my need for the textbooks lapsed so I did no further work.

  • commet-alt-w@lemmygrad.ml
    link
    fedilink
    arrow-up
    2
    ·
    2 years ago

    my first question would be what book is it? because some might have already published a pirated copy of it somewhere. might just be difficult to find.

    i saw someone mentioning making a wget script to download each page, which is what i would start thinking.

    the other thing that would be worth trying is the wkhtmltopdf tool written in python. it can be easily installed through pypi with pip. not sure what it would take to crawl and paginate the whole book without trying to experiment with the website tho. if there’s authentication involved and api layer security for their system it may become difficult/cumbersome to script downloading it

    crawling and paginating the webpages can be done with python too, using scrapy, or python’s own urlib/http.parser libraries. pretty sure scrappy is the framework that uses a browser to scrape the web with, so you can do something like login on the webpage without worrying about something like headers in authentication requests through curl or wget, or without worrying about some crud/rest library for interacting with an api

    it’s one of those problems where there’s no easy general solution to if the system, the website op is using, itself does not provide it as feature and heavily drm’s/restricts access