Lemmy.ml front page has been full of nginx errors, 500, 502, etc. And 404 errors coming from Lemmy.

Every new Lemmy install begins with no votes, comments, postings, users to test against. So the problems related to performance, scaling, error handling, stability under user load can not easily be matched given that we can not download the established content of communities.

Either the developers have an attitude that the logs are of low quality and not useful for identifying problems in the code and design, or the importance of getting these logs in front of the technical community and trying to identify the underlying patterns of faults is being given too low of a priority.

It’s also important to make each log of failures identifiable to where in the code this specific timeout, crash, exception, resource limit is encountered. Users and operations personnel reporting generic messages that are non-unique only slow down server operators, programmers, database experts, etc.

There are also a number of problems testing federation given the nature of multiple servers involved and trying not to bring down servers in front of end-users. It’s absolutely critical that failures for servers to federate data be taken seriously and attempts to enhance logging activities and triangulate causes of why peer instances have missing data be track down to protocol design issues, code failures, network failures, etc. Major Lemmy sites doing large amounts of data replication are an extremely valuable source of data about errors and performance. Please, for the love of god, share these logs and let us look for the underlying causes in hard to reproduce crashes and failures!

I really hope internal logging and details of the inner workings of the biggest Lemmy instances is shared more openly with more eyes on how to keep scaling the applications as the number of posts, messages, likes and votes continue to grow each and every day. Thank you.

Three recently created communities: !lemmyperformance@lemmy.ml!lemmyfederation@lemmy.ml!lemmycode@lemmy.ml

  • sunaurus@lemm.ee
    link
    fedilink
    arrow-up
    9
    ·
    1 year ago

    Hey buddy, I understand you’re frustrated, but I just want to make a few points:

    1. I have personally seen many instance admins and Lemmy contributors note many times over the past weeks that Lemmy is unoptimized and not ready for the current traffic
    2. I have myself mentioned it several times in announcements to users of my own Lemmy instance
    3. Lemmy maintainers have asked for help with optimization in several channels
    4. Lemmy maintainers are clearly working hard at fixing Lemmy issues and improving performance - just look at the work that went into 0.18 - the fact that it’s far from perfect is clear to everybody, but progress is constantly being made
    5. Lemmy maintainers have mentioned multiple times that their inboxes are full of notifications and DMs - it’s not that they’re brushing anything under the rug, it’s just that they’re not physically able to keep up with the volume of communication that is being thrown at them

    I really believe that you have some useful insights and can be very helpful for Lemmy, but I’m afraid that if you take this accusatory tone and blame people for not doing enough then that will overshadow anything helpful that you’re actually saying.

    Having said all that, if you would like to take a look at some stats about queries on lemm.ee (a Lemmy instance with 4k users - definitely much smaller than lemmy.ml), I have put together a spreadsheet here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSPpqM6QCZYAAvnWe8p-xxN553ukRIquHw71j3nB763x7TNeqeUO-Oss51yPC7zVaT2x4jll39NCeMu/pubhtml#

    • RoundSparrowOP
      link
      fedilink
      arrow-up
      2
      arrow-down
      3
      ·
      edit-2
      1 year ago

      Lemmy maintainers have asked for help with optimization in several channels

      I do not see them using Lemmy itself to actually discuss the problems of Lemmy. Specific to lemmy.ml and the developer relationship with this specific server, crashes (logs) are not being shared.

      10 days ago: https://lemmy.ml/post/1271936

      I can not emphasize the title of the posting you are reading enough. “Lemmy as a project has suffered all month because Lemmy.ml has not been sharing critical logs from Nginx and Lemmy’s code logging itself”

      Logs, logs, logs. Why were these crash logs not shared as part of the Lemmy project? When the most busy server on the whole project is not sharing their Rust code logs and crashes, what are us trying to work on the SQL and architecture problems supposed to do? I didn’t even report 1 in 100 of the crashes I was experiencing.

      It is a peer to peer network, server to server, and the central hub has encouraged everyone to run out and create new servers without any concern to report the crashes going on within the central hub. I just don’t get why everyone here is defending such behavior and leadership.

      What I see was sharing of CONCLUSIONS - that “increase the worker count” was the problem. No, the problem is fundamental to the whole Rust application’s automatically generated SQL statements, lack of data caching, lack of proper MTA and queue for federation inbound and outbound data. Just saying that the federation worker count was the problem and making the value infinite was not in any way getting to the problems that sharing the server crash logs would have exposed.

      June 14, the GitHub issue on “Scaling Federation” was CLOSED by project leadership! Meanwhile, lemmy.ml was crashing for me every hour! Failing to federate with any reliability too. June 15 is when https://lemmy.ml/post/1271936 was opened, the day after this CLOSE of a GitHub issue:

      The DDOS is coming from WITHIN THE HOUSE. Lemmy’s performance problems are causing federation to bring down peer servers, and the LOGS of Rust code exceptions that are being KEPT SECRET will reveal this! The sharing of logs and making this a federation-wide announcement that the hub is failing on data exchange is critical, not optional

      It’s sad to me that the leadership of this project can’t just come out and openly admit it is “experimental” project and “unstable”, and is ignoring https://lemmy.ml/post/1271936 and bragging on GitHub that it is “high performance Rust”. It might have seemed high performance when you sent 8 whole test messages to 4 servers a day, but that isn’t the meaning of “high performance”. depressing to see such denial and the people who believe in the “reality distortion field” around the project.