• alex [they/them]
    link
    fedilink
    11 year ago

    From the GitHub page:

    The pages crawled are determined by a central server at api.crawler.mwmbl.org. They are restricted to a curated set of domains (currently determined by analysing Hacker News votes) and pages linked from those domains.

    The URLs to crawl are returned in batches from the central server. The browser extension then crawls each URL in turn. We currently use a single thread as we want to make use of minimal CPU and bandwidth of our supporters.

    For each URL, it first checks if downloading is allowed by robots.txt. If it is, it then downloads the URL and attempts to extract the title and the beginning of the body text. An attempt is made to exclude boilerplate, but this is not 100% effective. The results are batched up and the completed batch is then sent to the central server.

    The batches are stored in long term storage (currently Backblaze) for later indexing. Currently indexing is a manual process, so you won’t necessarily see pages you’ve crawled in search results any time soon.