Yea, for whatever reason, lemmy.world became the sort of de facto “main” instance, which isn’t a bad thing and lemmy.world isn’t a bad choice at all, ruud AFAICT is a dedicated and experienced fediverse admin.
There may be issues to centralising the user load too much. I don’t have the technical knowledge to back this up, but it probably makes sense that there is such a thing as too much for one server to handle. If it has to handle all of the user requests as well has syncing all of the large and popular communities that a “main” instance is likely to host, then it’s just a lot and probably requires technical solutions and investment beyond what one admin/team is willing or capable of doing. Plus, lemmy the software may not be designed for that sort of load, which probably requires a distinct architecture from that of a smaller instance.
So it probably, at point at least, makes sense to spread the load of both the users and the communities. However, it seems that redditors as accustomed to a “central” and singular service as they are have kind of opted in to re-creating a central “main” instance like they’re used to. It may very well be a bad habit, as it presumes that there’s just some giant server and a dedicated tech team sitting there waiting to scale up at a moment’s notice. Of course, lemmy.world are free to halt sign ups and encourage users to pick other instances. But it remains to be seen how lemmy, its software and the fediverse/threadiverse in general handles communities/groups/magazines at this new scale.
In the mean time, intentionally spreading the load might help. As would donating to the developers and your admin!!
For a big instance getting an over-sized user base … roughly how many would that be and what could the instance do about it? I’d imagine a number of infrastructural things could be done before the core lemmy code base and design needs to be substantially changed or redesigned. Big separate database service, big beefy primary server/instance or even a cluster like kubernetes (which is what mastodon.social use AFAIK).
As for the alternative where many users are distributed more even across many instances, how well would that or can that scale with all of the community data that would need to be synced up between all the instances? From what I’ve gathered, it’s precise this kind of work that’s plagued lemmy.world somewhat and caused some of the issues that users have been having, largely, it seems, from the server being overloaded with “federation workers” timing out.
I can’t answer those questions because some active work would need to be done to get those insights, but those are the right questions to ask indeed.
I think my immediate assumption would be that the scale metrics that could end up starving resources would be something like: number of users on the current instance, number of posts on the current instance, number of comments on the current instance, number of new signups per minute on the current instance, number of new posts per minute on the current instance, number of new comments per minute on the current instance, number of total posts across all federated communities, number of total comments across all federated communities, number of new posts per minute across all federated communities, number of new comments per minute across all federated communities. That’s my list and I could be wrong about it since I know almost nothing about the underlying architecture, it would take a bit of team work to make it comprehensive.
From there, someone architect-level who knows the solution well should be able to prioritize that list. For instance: “the number of federated posts doesn’t concern me much, because we fetch the contents themselves directly from other instances, and if the concern is the size of the DB table, the number of comments will hit much higher much earlier anyway; so let’s look at comment stuff before we look at post stuff”. I have no idea if this is accurate, but you get the idea.
And then from there, you want to perform some load-testing. So, for instance, setting up two air-gapped test instance that can only federate each other, and injecting a ton of fake data to hit higher and higher numbers on the listed metrics. While that’s going on, all relevant resource usage (CPU, memory, …) would be monitored, to see what resource usages grow faster than comfortable.
With those results, you’d want to go back to current resource usage on real-world instances, and that should allow to extrapolate and prioritize. Like: “well, lemmy.world’s local posts are growing at that rate, and we’ve measured that the related metric only gets in trouble around that number, so basically at current rate we have 6 months to figure it out”.
And from there, you now know the problems, and can prioritize the solutions, based on urgency and cost. Some may be low-cost, there may be easy computations to parallelize or shard for instance; but of course you’d have to know what the worst ones are first, in order to tackle them in order. And then of course, some of them will probably be very tricky to get past.
One thing I can tell you is that, without knowing much of Lemmy’s architecture, I have the same intuition you do, that the decentralization of it will help mitigate with some resource usages in ways that Reddit couldn’t, for instance; but not all. I’m pretty sure that as instances add content, something grows in ALL instances federating that content, which might starve some critical resource at some point in all of them.
Yea, for whatever reason, lemmy.world became the sort of de facto “main” instance, which isn’t a bad thing and lemmy.world isn’t a bad choice at all, ruud AFAICT is a dedicated and experienced fediverse admin.
There may be issues to centralising the user load too much. I don’t have the technical knowledge to back this up, but it probably makes sense that there is such a thing as too much for one server to handle. If it has to handle all of the user requests as well has syncing all of the large and popular communities that a “main” instance is likely to host, then it’s just a lot and probably requires technical solutions and investment beyond what one admin/team is willing or capable of doing. Plus, lemmy the software may not be designed for that sort of load, which probably requires a distinct architecture from that of a smaller instance.
So it probably, at point at least, makes sense to spread the load of both the users and the communities. However, it seems that redditors as accustomed to a “central” and singular service as they are have kind of opted in to re-creating a central “main” instance like they’re used to. It may very well be a bad habit, as it presumes that there’s just some giant server and a dedicated tech team sitting there waiting to scale up at a moment’s notice. Of course, lemmy.world are free to halt sign ups and encourage users to pick other instances. But it remains to be seen how lemmy, its software and the fediverse/threadiverse in general handles communities/groups/magazines at this new scale.
In the mean time, intentionally spreading the load might help. As would donating to the developers and your admin!!
I have the technical knowledge to back it up, and I confirm your understanding is spot on.
Thanks!
Can we pick your brain on this?
For a big instance getting an over-sized user base … roughly how many would that be and what could the instance do about it? I’d imagine a number of infrastructural things could be done before the core lemmy code base and design needs to be substantially changed or redesigned. Big separate database service, big beefy primary server/instance or even a cluster like kubernetes (which is what mastodon.social use AFAIK).
As for the alternative where many users are distributed more even across many instances, how well would that or can that scale with all of the community data that would need to be synced up between all the instances? From what I’ve gathered, it’s precise this kind of work that’s plagued lemmy.world somewhat and caused some of the issues that users have been having, largely, it seems, from the server being overloaded with “federation workers” timing out.
I can’t answer those questions because some active work would need to be done to get those insights, but those are the right questions to ask indeed.
I think my immediate assumption would be that the scale metrics that could end up starving resources would be something like: number of users on the current instance, number of posts on the current instance, number of comments on the current instance, number of new signups per minute on the current instance, number of new posts per minute on the current instance, number of new comments per minute on the current instance, number of total posts across all federated communities, number of total comments across all federated communities, number of new posts per minute across all federated communities, number of new comments per minute across all federated communities. That’s my list and I could be wrong about it since I know almost nothing about the underlying architecture, it would take a bit of team work to make it comprehensive.
From there, someone architect-level who knows the solution well should be able to prioritize that list. For instance: “the number of federated posts doesn’t concern me much, because we fetch the contents themselves directly from other instances, and if the concern is the size of the DB table, the number of comments will hit much higher much earlier anyway; so let’s look at comment stuff before we look at post stuff”. I have no idea if this is accurate, but you get the idea.
And then from there, you want to perform some load-testing. So, for instance, setting up two air-gapped test instance that can only federate each other, and injecting a ton of fake data to hit higher and higher numbers on the listed metrics. While that’s going on, all relevant resource usage (CPU, memory, …) would be monitored, to see what resource usages grow faster than comfortable.
With those results, you’d want to go back to current resource usage on real-world instances, and that should allow to extrapolate and prioritize. Like: “well, lemmy.world’s local posts are growing at that rate, and we’ve measured that the related metric only gets in trouble around that number, so basically at current rate we have 6 months to figure it out”.
And from there, you now know the problems, and can prioritize the solutions, based on urgency and cost. Some may be low-cost, there may be easy computations to parallelize or shard for instance; but of course you’d have to know what the worst ones are first, in order to tackle them in order. And then of course, some of them will probably be very tricky to get past.
One thing I can tell you is that, without knowing much of Lemmy’s architecture, I have the same intuition you do, that the decentralization of it will help mitigate with some resource usages in ways that Reddit couldn’t, for instance; but not all. I’m pretty sure that as instances add content, something grows in ALL instances federating that content, which might starve some critical resource at some point in all of them.
Awesome!! Thanks!!