Lemmy.world had to shut down the front page and put up a message about the load and a graph. They seem to chalk it down to the nature of social media sites to attract attacks.

I’d hack up the Rust code to have self-awareness of concurrency with PostgreSQL and return a new busy error.

Federation connections, RSS feed, API - and any other method that is hitting the database needs to have a concurrency count in the Rust code and an error message system for busy.

I’d probably build a a class to help with this and once concurrency for an API is over 5 mark the high water with a timestamp and start doing logic based on elapsed time. If > 5 and elapsed time exceeds a threshold (say 1 minute), then return the busy error.

is Prometheus the right way to expose these numbers for operators wanting to know about the thresholds.? I’d probably add a dedicated log file to track concurrency thresholds and busy errors.

the front-end apps also need to be caching “Trending communities”, I think lemmy-ui is still pulling that live from PostgreSQL for every refresh of the page. I need to check if anyone has added that.

  • RoundSparrowOP
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Maybe I’m overthink the performance problems.

    Deleting accounts probably creates a swarm of activity like I opened a GitHUb issue, and it’s already been a source of problem triggering bugs. But even without bugs, it’s stil got to make the system run way slower. And there is nothing preventing someone from setting up a federation instance, creating a bunch of content, then deleting it - triggering multiple servers to overload.

    The variability of performance on reads could be directly tied to how much writes are gong on with account deletion.

    Even comment reply chains seem to be triggering (replaceable) performance concerns.

  • RoundSparrowOP
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    so, some work to do:

    1. rework the testing scripts so that they don’t actually delete data each run.
    2. can I use bash script to get pg_stat_statements between individual tests
  • RoundSparrowOP
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    Sometimes you wish you could have the API log everything and be able to play back API activity on the test data. Maybe I’ll play around with such a feature.

    ‘Lemmy.world has been down between 02:00 UTC and 05:45 UTC. This was caused by the database spiking to 100% cpu (all 32 cores/64 threads!) due to inefficient queries been fired to the db very often.’