I suggest Lemmy incoming federation inserts of votes, comments, and possibly postings - be queued so that concurrent INSERT operations into these very large database tables be kept linear so that local-instance interactive web and API (app) users are given performance priority.
This could also be a way to keep server operating costs more predictable with regard to using cloud-services for PostgreSQL.
There are several approaches that could be taken: Message Queue systems, queue to disk files, queue to an empty PostgreSQL table, queue to another database system such as SQLite, etc.
This would also start the basis for being able to accept federation incoming data while the PostgreSQL is down / website is offline for upgrades or whatever.
I would also suggest code for incoming federation data be moved to a different service and not run in-process of lemmy_server. This would be a step towards allowing replication integrity checks, backfill operations, firewall rules, CDN bypassing, etc
EDIT: And really much of this applies to outgoing, but that has gotten more attention in 0.17.4 time period - but ultimately I was speculating that the incoming backend transactions are a big part of why outbound queues are bunching up so much.
So glad to see this and think this is super important conversation to be had if Reddit exodus makes their way to ActivityPub platforms such as Lemmy.
The current issue is two parts:
-
Messages are signed for only 10 seconds; I don’t know why, but I’m hoping the change in activitypub-federation-rust (line 70) will alleviate some backed up queue issue.
-
Protocol itself doesn’t seem very scalable; if every action must be emitted outwards to every federated server via HTTPS POST to every applicable instance, it bears to reason that as more people embrace the Fediverse and spin up their own server, and as communities grow, the outbound message requirement will grow exponentially and be unsustainable.
Having independent queues, and message workers, all deployed as independently scalable components is going to be a big step forward, but ultimately will still impose a lot of load on the big servers such as lemmy.ml. I think on top of improvements to the implementation of ActivityPub, Lemmy needs to add additional extensions such as statically cached interval-based activity log (with tiered clumping and eventual fall off) for each community that can be requested and ingested. That is, it would be very beneficial if, for example, when a community reach a certain scale (think
!technology@lemmy.ml
for example), it could publish activity log of past 15 minutes, half hour, hour, day, and week. That way, even if there were missed/delayed messages, instances could “catch up” by consuming these cached files (that doesn’t even need to hit the DB).I hope this makes sense, and I hope we see Lemmy grow further :)
So glad to see this and think this is super important conversation to be had if Reddit exodus makes their way to ActivityPub platforms such as Lemmy.
The way many Lemmy instance operators are gong online, wanting to run on shoestring hardware budgets, I think a store-and-forward design like email or Usenet used to be (in the 1990’s) is a more proven design. And even it has to start as only “Lemmy to Lemmy”, some kind of bulk transfer session concept.
I’ve even been tossing around the idea in my mind of some kind of (optional to have exposed) notification to end-users on delayed delivery, like beyond 3 minutes, so there is some awareness among the community that Lemmy isn’t run on the kind of budget Facebook is with hardware, operations teams, etc.
Concepts like reliably deleting a message and editing a message I think are also going to need end-user education. On Reddit, edit wars aren’t unknown, especially in heated topics. Small time owner/operators are a different world from what users have known with the Big Guys.
-
GItHub issue 3188 has mostly been ignored by the project, I commented on it today - because even with the fixes lemmy.world pushed into production today, PostgreSQL insert is still slow once you get significant amounts of data into the tables: https://github.com/LemmyNet/lemmy/issues/3188
Ok, so attention is going to it: https://github.com/LemmyNet/lemmy/pull/3493
The comment database table is going to have a lot of concurrency concerns, remote federated servers are going to be all connecting at the same time with INSERT transactions into that table. The primary key on that table is going to have a lot of contention, and the interactive website users should be given highest priority.
Let the HTTPS connection be accepted, get the data, queue the data to a place that does not rely on locking the comment database table, release the HTTPS connection. Then do a linear insert of those new records one at a time.
A GitHub issue was opened on the topic of using message queues for outbound.
https://github.com/LemmyNet/lemmy/issues/3230
I think both inbound and outbound should have it.
Moving federation out of lemmy_server into an independent app and service would also allow rework of the backend without breaking federation.
Comments are the bulk data of the site, and most end users are only going to be reading what is ‘fresh’ on the site, loading data form the past 7 days. There is potential for tiered storage.