I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can’t seem to make it click.
For context I am trying to scrape books myself that I can’t seem to find elsewhere so I can use and post them for others.
The scraper tutorial
Hackernoon tutorial by Ethan Jarell
I initially tried to follow this but I kept having a “couldn’t find module” error. Since I have never touched python prior to this, I am unaware how to fix this and the help links are not exactly helpful. If there’s someone who could guide me through this tutorial that would be great.
Selenium
I don’t really get what this is but I think its some sort of python pack and it tells me to download using the pip command but that doesn’t seem to work (syntax error). I don’t know how to manually add it in because, again, I have little idea of what I’m doing.
Scrapy
This one seemed like it’d be an out-of-box deal but not only does it need the pip command to download but it has like 5 other dependencies it needs to function which complicates it more for me.
I am not criticizing these wares, I am just asking for help and if someone could help with the simplification of it all or maybe even point me to an easier method that would be amazing!
Updates
- Figured out that I am supposed to run the command for pip in the command prompt thing on my computer, not the python runner.
py -m
followed by the pip request
-
Got the Ethan Jarrell tutorial to work and managed to add in selenium, which made me realize that selenium isn’t really helpful with the project. rip xP
-
Spent a bunch of time trying to workshop the basic scraper to work with dynamic sites, unsuccessful
-
Online self-help doesn’t go in as much as I would like, probably due to the legal grey area
We use node.js with puppeteer for some of our web crawling at work. It’s pretty straightforward once you have a basic script to launch it. If you havent already I’d highly suggest installing vs code. You install node.js, then using npm (node package manager) install puppeteer and whatever other dependencies you might have. Someone out there probably has a basic js file out there that will open chrome, or just ask an LLM (I just use ChatGPT, they’re all the same shit). From there you just need to navigate to your pages, then use a queryselector and .click() to click on your elements. It’s all javascript from there.
Pro tip: write your queryselectors in your browser using the inspect element/console tab, then put it in your JS file. Nothing is worse than being 10 minutes into a crawl and you’ve got a queerselector.
I don’t like to touch js so ive being going python only. (besides basic html & Css) but I found puppeteer and didn’t really get it.