Lemmy.world had to shut down the front page and put up a message about the load and a graph. They seem to chalk it down to the nature of social media sites to attract attacks.
I’d hack up the Rust code to have self-awareness of concurrency with PostgreSQL and return a new busy error.
Federation connections, RSS feed, API - and any other method that is hitting the database needs to have a concurrency count in the Rust code and an error message system for busy.
I’d probably build a a class to help with this and once concurrency for an API is over 5 mark the high water with a timestamp and start doing logic based on elapsed time. If > 5 and elapsed time exceeds a threshold (say 1 minute), then return the busy error.
is Prometheus the right way to expose these numbers for operators wanting to know about the thresholds.? I’d probably add a dedicated log file to track concurrency thresholds and busy errors.
the front-end apps also need to be caching “Trending communities”, I think lemmy-ui is still pulling that live from PostgreSQL for every refresh of the page. I need to check if anyone has added that.
Sometimes you wish you could have the API log everything and be able to play back API activity on the test data. Maybe I’ll play around with such a feature.
‘Lemmy.world has been down between 02:00 UTC and 05:45 UTC. This was caused by the database spiking to 100% cpu (all 32 cores/64 threads!) due to inefficient queries been fired to the db very often.’