I already get rate-limited like crazy on lemmy and there are only like 60,000 users on my instance. Is each instance really just one server or are there multiple containers running across several hosts? I’m concerned that federation will mean an inconsistent user experience. Some instances many be beefy, others will be under resourced… so the average person might think Lemmy overall is slow or error-prone.

Reddit has millions of users. How the hell is this going to scale? Does anyone have any information about Lemmy’s DB and architecture?

I found this post about Reddit’s DB from 2012. Not sure if Lemmy has a similar approach to ensure speed and reliability as the user base and traffic grows.

https://kevin.burke.dev/kevin/reddits-database-has-two-tables/

  • Max-P@lemmy.max-p.me
    link
    fedilink
    English
    arrow-up
    106
    arrow-down
    1
    ·
    2 years ago

    Bigger instances will indeed run multiple copies of the various components, it’s pretty standard software in that regard.

    Usually at first that will start by moving the PostgreSQL database to its own dedicated box, and then start adding additional backend boxes, possibly adding more caching in front so that the backend doesn’t have to do as much work. Once the database is pegged, the next step is usually a write primary and one or more read secondaries. When that gets too much, you get into sharding so that you can spread the database load across multiple servers. I don’t know much about PostgreSQL but I have to assume it’s better than MySQL in that regard and I’ve seen a 1 TB MySQL database in the wild running just fine.

    I think lemmy.world in general is hitting some scalability issues that they’re working on. Keep in mind the software is fairly new and is just being truely tested at large scale, there’s probably a ton of room for optimization. Also lemmy.world is still on 0.17 and apparently 0.18 changed the protocol a lot in a way that makes it scale much better, so when they complete that upgrade it’ll probably run a lot better already.


    The part that worries me about scalability in the long term is the push nature of ActivityPub. My server is already getting several POST requests to /inbox per second already, which makes me wonder how that’s gonna work if big instances have to push content updates to thousands of lemmy instances where most of the data probably isn’t even seen. I was surprised it was a push system and not a pull system, as pull is much easier to scale and cache at the CDN level, and can be fetched on demand for people that only checks lemmy once in a while.

    I need to start digging into Lemmy’s code and get familiar with the internals, still only a couple days in with my private instance.

    • Schooner
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 years ago

      Would it be feasible to change it to a pull system at this point? I don’t think the Lemmy part of that is a problem, but ActivityPub may need to make big changes and I’m not sure how practical that is.

      • terribleplan@lemmy.nrd.li
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        That would have to be a completely new version of ActivityPub, and would likely render it non-backwards-compatible (or at least things would have to still do all the old version stuff to interop with anything not on it). This has happened before (see OStatus and Diaspora vs current things using ActivityPub).

        OStatus was based on Atom (like RSS) and WebStreams (aka PubSubHubub), which was basically a pull system with a real-time notification layer on top (that could offload the fan-out work of notifications to more centralized PSH servers), and things moved away from that in favor of the more real-time ActivityPub protocol.

        I mean, I hate XML as much as the next guy, but I think there was maybe a little too much baby (general architecture) in with that bathwater (specifics of Atom/XML/Microblogging-oriented stuff).

      • DaEagle
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        It doesn’t have to change, it can support both, so lemmy-lemmy can be pull, and lemmy-kbin can be push.

        Obviously not trivial, but definitely doable.

    • aaron@lemm.ee
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      2
      ·
      2 years ago

      The implementation as far as I understand it is plain stupid. It prevents small instances from participating at any significant scale and seems happy to just drop data over the wire without reconciling. Whoever designed this needs to be hired at Reddit.

      • Max-P@lemmy.max-p.me
        link
        fedilink
        English
        arrow-up
        7
        ·
        2 years ago

        I’m not sure about the data being dropped, my instance was misconfigured for a day or two, and as soon as I fixed it, the data came right in. Instances repeatedly trying to push data to my instance is what clued me in that something was missing from my NGINX config. It backfilled pretty fast.

        Although I wouldn’t mind if there was a fallback pull mechanism to remediate failed pushes.

        • aaron@lemm.ee
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          2 years ago

          Interesting. Curious if you have a better understanding of ActivePub - do you happen to know if the protocol guarantees synchonicity and what mechanism guarantees it?

          • Max-P@lemmy.max-p.me
            link
            fedilink
            English
            arrow-up
            3
            ·
            2 years ago

            I don’t, really going off on ~1 week of running my essentially single user instance and watching it do its thing. I need to read the spec and experiment with it when I have some more free time.

            Pure speculation but my guess would be that the servers are expected to retry for a certain amount of time. I know there’s been some tickets opened for some big instances going out of sync with eachother and fixes being worked on to address those. I don’t know if it only fixes it forward or if that also backfills.

            Also nothing preventing Lemmy from implementing a fallback way of doing a resync if it detects drift. “Hey lemm.ee, I lost everything since an hour ago, backfill please”.

            • aaron@lemm.ee
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              2 years ago

              Yeah, or batching changes and confirming receipt with a hash, or doing pull instead of push. From what I’ve been reading, the design seems a little janky.

    • FarceMultiplier@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      I’m thinking something like a large postgresQL database box, then front facing servers running ldirectord or a similar HA load balancer to be able to add instances as necessary. However, my skills here are 20 years out of date so I’m sure there’s better out there.

      • Max-P@lemmy.max-p.me
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 years ago

        Some people are already deploying it with Kubenetes, pretty much handles load balancing and even scaling up and down automatically out of the box if you’re set up in the cloud. Pretty much a long solved problems.

        Lots of nice free load balancers these days: HAproxy, NGINX, Traefik, I’ve even seen people do it in kernel with eBPF or iptables.

    • weeezes@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 years ago

      The part that worries me about scalability in the long term is the push nature of ActivityPub. My server is already getting several POST requests to /inbox per second already, which makes me wonder how that’s gonna work if big instances have to push content updates to thousands of lemmy instances where most of the data probably isn’t even seen. I was surprised it was a push system and not a pull system, as pull is much easier to scale and cache at the CDN level, and can be fetched on demand for people that only checks lemmy once in a while.

      I think there’s a benefit to the push model, as the instances can prioritize who to push to first if there’s scaling issues, instead of having to throttle GETs, effectively the end result is anyway the same that nothing ends up to other instances in real time (which is fine). I don’t know how lemmy works exactly, but could the push model just be a detail of activitypub https://flak.tedunangst.com/post/what-happens-when-you-honk ?

      • Max-P@lemmy.max-p.me
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 years ago

        It definitely is part of the ActivityPub protocol, but I only glanced at the spec so far. I should probably follow that link and implement a toy ActivityPub app to get more familiar with how it works.

        I think there’s a benefit to the push model, as the instances can prioritize who to push to first if there’s scaling issues, instead of having to throttle GETs,

        The downside to this is smaller instances are penalized in that scenario, which would in turn could cause users to flock to megainstances until it becomes centralized again.

        As I said, GETs are cacheable, so if one slaps Cloudflare in front you can handle millions of GETs for relatively cheap.

        Maybe it’s batched however? I really need to read the spec. Pushing to thousands of servers every 1/5/10 minutes certainly would give a fair amount of headroom to make it work I guess.