So here’s the deal with kbin: kbin uses of symphony messenger processes, which are roughly equivalent to sidekiq in mastodon.
After I moved fedia from the docker hosted environment to a bare metal instance, I had all manner of database issues - the dump and reload didn’t work well, creating many duplicate records. That caused the messenger services to die and the queue of activitypub records to process grew huge. Restarting the messenger service worked, however it would never finish, so I increased the number of messenger workers to 16. That kept the queue nice and clean.
HOWEVER, it appears that running multiple messenger processes creates race conditions where things like images ids are created and assigned to different entity records (like posts) but there is no actual image record created, so when kbin goes to draw a page, it runs a complex query to pull magazine info, post info, comments info, user info and all of their respective images. Those records LOOK like they have an image, but there is no actual image, and so kbin says 💩 I ain’t working and gives the wonderful 500 error.
Setting the messenger services back to 1 seems to be at least not be making the problem worse, but now I have to go find all the broken database record linkages.
Thanks for all the work you’ve put it! Seems to work fine now since a while back.
Ouch! Thanks for all your hard work on this. As somebody who has one foot in the IT world, I can empathise with the difficulty of managing parallel processes.
Mine was stable when you first posted this, but after maybe an hour I started getting 500s again. Currently if I try to look at subscriptions it throws that error, other areas seem ok at the moment.
There are definitely still a few magazines or threads or messages or posts or users or all of the above that are hosed up. I’ve been running mad SQL statements to try to hunt them down and I’m now into the long tail of issues.
What a mess, thanks Jerry for digging into this.
The magazine I created (photography) seems to be suffering from many 500 errors but makes sense that it had several image posts.
When it works, the images appear to be there but also got more 500’s than Indianapolis in May.Though at the moment seems to be stable.
have you seen any errors over the past 3 hours?