For a big instance getting an over-sized user base … roughly how many would that be and what could the instance do about it? I’d imagine a number of infrastructural things could be done before the core lemmy code base and design needs to be substantially changed or redesigned. Big separate database service, big beefy primary server/instance or even a cluster like kubernetes (which is what mastodon.social use AFAIK).
As for the alternative where many users are distributed more even across many instances, how well would that or can that scale with all of the community data that would need to be synced up between all the instances? From what I’ve gathered, it’s precise this kind of work that’s plagued lemmy.world somewhat and caused some of the issues that users have been having, largely, it seems, from the server being overloaded with “federation workers” timing out.
I can’t answer those questions because some active work would need to be done to get those insights, but those are the right questions to ask indeed.
I think my immediate assumption would be that the scale metrics that could end up starving resources would be something like: number of users on the current instance, number of posts on the current instance, number of comments on the current instance, number of new signups per minute on the current instance, number of new posts per minute on the current instance, number of new comments per minute on the current instance, number of total posts across all federated communities, number of total comments across all federated communities, number of new posts per minute across all federated communities, number of new comments per minute across all federated communities. That’s my list and I could be wrong about it since I know almost nothing about the underlying architecture, it would take a bit of team work to make it comprehensive.
From there, someone architect-level who knows the solution well should be able to prioritize that list. For instance: “the number of federated posts doesn’t concern me much, because we fetch the contents themselves directly from other instances, and if the concern is the size of the DB table, the number of comments will hit much higher much earlier anyway; so let’s look at comment stuff before we look at post stuff”. I have no idea if this is accurate, but you get the idea.
And then from there, you want to perform some load-testing. So, for instance, setting up two air-gapped test instance that can only federate each other, and injecting a ton of fake data to hit higher and higher numbers on the listed metrics. While that’s going on, all relevant resource usage (CPU, memory, …) would be monitored, to see what resource usages grow faster than comfortable.
With those results, you’d want to go back to current resource usage on real-world instances, and that should allow to extrapolate and prioritize. Like: “well, lemmy.world’s local posts are growing at that rate, and we’ve measured that the related metric only gets in trouble around that number, so basically at current rate we have 6 months to figure it out”.
And from there, you now know the problems, and can prioritize the solutions, based on urgency and cost. Some may be low-cost, there may be easy computations to parallelize or shard for instance; but of course you’d have to know what the worst ones are first, in order to tackle them in order. And then of course, some of them will probably be very tricky to get past.
One thing I can tell you is that, without knowing much of Lemmy’s architecture, I have the same intuition you do, that the decentralization of it will help mitigate with some resource usages in ways that Reddit couldn’t, for instance; but not all. I’m pretty sure that as instances add content, something grows in ALL instances federating that content, which might starve some critical resource at some point in all of them.
Thanks!
Can we pick your brain on this?
For a big instance getting an over-sized user base … roughly how many would that be and what could the instance do about it? I’d imagine a number of infrastructural things could be done before the core lemmy code base and design needs to be substantially changed or redesigned. Big separate database service, big beefy primary server/instance or even a cluster like kubernetes (which is what mastodon.social use AFAIK).
As for the alternative where many users are distributed more even across many instances, how well would that or can that scale with all of the community data that would need to be synced up between all the instances? From what I’ve gathered, it’s precise this kind of work that’s plagued lemmy.world somewhat and caused some of the issues that users have been having, largely, it seems, from the server being overloaded with “federation workers” timing out.
I can’t answer those questions because some active work would need to be done to get those insights, but those are the right questions to ask indeed.
I think my immediate assumption would be that the scale metrics that could end up starving resources would be something like: number of users on the current instance, number of posts on the current instance, number of comments on the current instance, number of new signups per minute on the current instance, number of new posts per minute on the current instance, number of new comments per minute on the current instance, number of total posts across all federated communities, number of total comments across all federated communities, number of new posts per minute across all federated communities, number of new comments per minute across all federated communities. That’s my list and I could be wrong about it since I know almost nothing about the underlying architecture, it would take a bit of team work to make it comprehensive.
From there, someone architect-level who knows the solution well should be able to prioritize that list. For instance: “the number of federated posts doesn’t concern me much, because we fetch the contents themselves directly from other instances, and if the concern is the size of the DB table, the number of comments will hit much higher much earlier anyway; so let’s look at comment stuff before we look at post stuff”. I have no idea if this is accurate, but you get the idea.
And then from there, you want to perform some load-testing. So, for instance, setting up two air-gapped test instance that can only federate each other, and injecting a ton of fake data to hit higher and higher numbers on the listed metrics. While that’s going on, all relevant resource usage (CPU, memory, …) would be monitored, to see what resource usages grow faster than comfortable.
With those results, you’d want to go back to current resource usage on real-world instances, and that should allow to extrapolate and prioritize. Like: “well, lemmy.world’s local posts are growing at that rate, and we’ve measured that the related metric only gets in trouble around that number, so basically at current rate we have 6 months to figure it out”.
And from there, you now know the problems, and can prioritize the solutions, based on urgency and cost. Some may be low-cost, there may be easy computations to parallelize or shard for instance; but of course you’d have to know what the worst ones are first, in order to tackle them in order. And then of course, some of them will probably be very tricky to get past.
One thing I can tell you is that, without knowing much of Lemmy’s architecture, I have the same intuition you do, that the decentralization of it will help mitigate with some resource usages in ways that Reddit couldn’t, for instance; but not all. I’m pretty sure that as instances add content, something grows in ALL instances federating that content, which might starve some critical resource at some point in all of them.
Awesome!! Thanks!!