• RoundSparrowOPM
    link
    fedilink
    arrow-up
    2
    ·
    2 years ago

    The page we are on, a “single posting page” https://lemmy.ml/post/1160776 - is what I’m talking about in this comment.

    It looks like we have a Integer primary key for each post, 1160776 being the current one you are reading. So that can be key to any secondary persistent caching system (disk file storage or NoSQL database typically).

    I’ve been building social media message systems since 8-bit days, 1984, on a 1 Mhz processor and a very slow floppy disk for storage. I’ve also worked on large-scale enterprise systems and have seen the problems of data recovery and surge activity (like a website being featured on a TV show and big surge of users).

    Here are some general brainstorming thoughts to start with:

    1. I would consider a different code path for a posting that has less than 20 comments from one that has 20 or more. If a single posting has 200 comments, Reddit (for example) doesn’t load them by default, and the maximum the user interface allows (on old reddit) is 500.
    2. I would consider a different code path for a posting that is fresh and has recent activity vs one that is inactive. Reddit ‘archives’ old posts at 6 months, as an example. But i would even consider anything that has not had a vote or a comment in the past 12 hours to be treated as a different code path in terms of potential caching.
    3. Nested replies beyond a certain level typically become a user interface issue so I’ve seen Reddit optimize a “load more” to get past say 4 levels of thread depth on a reply thread.
    4. Pagination of comments. If you get a posting with 2500 comments, you need some way to navigate into blocks of them besides just changing the sort order.

    Fallback behaviors:

    1. Under extreme server load, is it acceptable to not show the comments and return just the posting - and inform the user via a stub comment that says they should return in a few minutes?
    2. If you are referencing the status of a user-profile (the user who makes an individual comment), consider degrading that if the server is too busy. Example: a comment is 8 hours ago, but the user deleted their account 2 hours ago. Having to look up the profile of every single comment may be something you consider degrading/skipping under heavy load or postings with high number of comments.
    • RoundSparrowOPM
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      2 years ago

      Some thoughts on what to consider… again, we have a Integer primary key for each post, 1160776 being the current one…

      I would have a dirty-status database table (or even something outside the database) for each posting. Maybe a timestamp of when the post was last write touched. Edit to the post, vote change on post, vote change on comments in that post, deleted or edited comment, etc. (Twitter and Reddit in recent years now keep and expose traffic accounting of every time a post is read… that can be a lot more writing of data, but community mods and end-users may want to know this information.)

      You would use this dirty-status information as the basis for when a cached result is rebuilt from the live database. I would perhaps think in terms of 10 or 30 seconds, if a intermediate-cached result hasn’t been generated in over 30 seconds, you would rebuild it (new comments or vote count changes likely being the reason you would need to rebuilt it). You probably would want to try to do this in a more linear batch style, as you have to consider if your API is being crawled or heavily hit and you have concurrent posts all triggering rebuilds in a short period of time.

      I understand that in an ideal world you let the DBMS do all this kind of work for you intelligently. And that’s the approach Lenny is currently built on, that the DBMS is smart enough to intelligently manage buffers and caches internally to deal with lots of repeat-output activity. But classically, the database systems are not that smart and not really that fast at shuffling lots of text data through database drivers. That is kind pretty much why NoSQL database became so popular… performance issues (and predictability under surging loads).

    • Barbarian
      link
      fedilink
      arrow-up
      2
      ·
      2 years ago

      Under extreme server load, is it acceptable to not show the comments and return just the posting - and inform the user via a stub comment that says they should return in a few minutes?

      From a user perspective, I think returning a small number of comments (maybe top 10)? would be way better than none (followed by some “Server is under extreme load try later” message). It at least gives the user a general idea of the vibe while minimizing load.

      • RoundSparrowOPM
        link
        fedilink
        arrow-up
        1
        ·
        2 years ago

        It’s the sorting that people do on Reddit-style posting, old/new/hot that is kind of the problem, the database has to get into every top-level comment to sort. Touching comments at all is a live wire ;)