It depends on what you use on the Python side. Classically that would have been uWSGI or one of the *SGI interfaces, and lately ASGI.
Sure, one can totally make Python apps that serve HTTP directly. The same can be done with PHP (and Ruby and others) as well, but most people still run their PHP through PHP-FPM over FastCGI because you can offload a lot of the work to the much faster NGINX side. A fair amount of apps make use of X-Accel-Redirect
to serve private files, so you don’t tie up a PHP worker for an hour serving the user’s 2GB file.
But yes, as those languages all move to async computing and away from worker pools, it’s more common to see those serve HTTP directly, and there’s less and less need for a proxy that supports those other protocols. The async event loop is what made NGINX special when it came out, so naturally languages that moves to that model greatly reduce the need for that as well, they too can easily handle thousands of concurrent connections no problems. Plus these days people slap a CDN in front anyway so static file performance doesn’t matter quite as much.
I ran a YaCy instance for a while like a decade ago. It does federate index requests, and when you search it propagates the search request across a bunch of nodes. When my node came online it almost immediately started crawling stuff and it did get a bunch of search queries. But the network was still pretty small back then and the search results were… not great. That’s the price of independence from Google’s and Microsoft’s giant server farms, it’s hard to compete with that size.
But at the rate Google and Bing are enshittifying, I think it’s worth revisiting.
Using ActivityPub for this would be immensely wasteful. It’s just not feasable that all instances would have the whole index because it’s so large. Back when I tried it, the network still had several TBs worth of indexed pages. This is firmly in the realm of distributed P2P systems. One could have an ActivityPub plugin however to receive updates from social media near instantly and index those immediately with less overhead. But you still want to index wikipedia, forums, blogs, whatever the crawlers can find.