Lemmy.world had to shut down the front page and put up a message about the load and a graph. They seem to chalk it down to the nature of social media sites to attract attacks.

I’d hack up the Rust code to have self-awareness of concurrency with PostgreSQL and return a new busy error.

Federation connections, RSS feed, API - and any other method that is hitting the database needs to have a concurrency count in the Rust code and an error message system for busy.

I’d probably build a a class to help with this and once concurrency for an API is over 5 mark the high water with a timestamp and start doing logic based on elapsed time. If > 5 and elapsed time exceeds a threshold (say 1 minute), then return the busy error.

is Prometheus the right way to expose these numbers for operators wanting to know about the thresholds.? I’d probably add a dedicated log file to track concurrency thresholds and busy errors.

the front-end apps also need to be caching “Trending communities”, I think lemmy-ui is still pulling that live from PostgreSQL for every refresh of the page. I need to check if anyone has added that.

  • RoundSparrowOP
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Sometimes you wish you could have the API log everything and be able to play back API activity on the test data. Maybe I’ll play around with such a feature.

    ‘Lemmy.world has been down between 02:00 UTC and 05:45 UTC. This was caused by the database spiking to 100% cpu (all 32 cores/64 threads!) due to inefficient queries been fired to the db very often.’