Lemmy currently uses distinct tables like post_like: (post_id, person_id, score) and post_saved. Unfortunately this causes performance issues when we have to join many of these tables to create views.

One suggestion in this PR, is to combine these into a single post_action table, with a lot of optional columns depending on the action. This solution scares me a little, because I’m afraid we might lose data integrity, and many of our constraints with so many optional columns.

Is there a better way of doing this in SQL?

  • xmunk@sh.itjust.works
    link
    fedilink
    arrow-up
    12
    ·
    9 months ago

    Have you considered keeping the data writes to their current tables and deploying an indexed materialized view over those tables for more efficient reads? You can normalize your data and eat your denormalized cake too!

    Let me know if you have any technical questions about how to.

    • DessalinesOPA
      link
      fedilink
      arrow-up
      3
      ·
      9 months ago

      We stopped using materialized views in lemmy a few years ago, because of caching / stale data issues.

      • xmunk@sh.itjust.works
        link
        fedilink
        arrow-up
        6
        ·
        9 months ago

        That’s fair - though you can use trigger based refreshes, there will always be some kind of inconsistency window. That was my only real silver bullet suggestion. Otherwise, you’ll just need to modify the underlying structure.

        From what I’ve read of the change, it doesn’t look too dangerous. My only concern would be around concurrent writes where someone upvotes and saves a post in rapid succession. The logic for both actions needs to support a post_action record existing or not existing and potentially starting to exist mid-write. There will be some ugly edge cases with that approach, but it should be doable if carefully done.

        Sorry if that isn’t super helpful, but thanks for your work maintaining the platform. It’s appreciated!

        • DessalinesOPA
          link
          fedilink
          arrow-up
          2
          ·
          9 months ago

          No probs, thx for your help! We should be okay with writes in close succession, as long as it only updates that specific column. I think I’m coming around to the idea that a post_action table would be fine.

          • xmunk@sh.itjust.works
            link
            fedilink
            arrow-up
            4
            ·
            9 months ago

            Yea, I’ve worked as a data architect - I share your knee-jerk fear of denormalization but I read the proposal over and I agree: it’s dangerous if done sloppily but as long as you’re careful it’s do-able.

            I’ve been working for a while on a product for medical professionals. It’s an absolute blessing to my hair-line that Doctors consider anything faster than two minutes to essentially be instantaneous. Unfortunately the lemmy user base hasn’t been dissuaded that a better world could exist by decades of horribly written software. Locks and setting aside a few dozen milliseconds for mat view refreshes are perfectly acceptable in my day job… but these darn Lemmy users expect a performant and stable product.

            Just again though, thanks for the good work!

  • eluvatar@programming.dev
    link
    fedilink
    arrow-up
    12
    arrow-down
    1
    ·
    9 months ago

    This isn’t SQL specific, but a PR whose target is improving performance should measure the performance. It can be a lot of work, especially to get a representative dataset, but it will be worth it, then you can make tweaks to maximize performance, with numbers in hand. Who knows maybe this new design has a flaw and the performance is actually worse, maybe it’s better but it’s not worth the change. Right now you have no idea.

    • xmunk@sh.itjust.works
      link
      fedilink
      arrow-up
      4
      ·
      9 months ago

      Something like a pgmustard or depesz analysis of some painful real world queries can be invaluable here.

    • morrowind
      link
      fedilink
      arrow-up
      10
      ·
      9 months ago

      I’m pretty sure the developers have better than chatgpt level knowledge. There’s a reason they’re asking for experts

    • xmunk@sh.itjust.works
      link
      fedilink
      arrow-up
      6
      ·
      9 months ago

      CharGPT has its uses, performance tuning is not (yet) one of them. I think ML in performance tuning will eventually be a huge deal when we can leverage rapid iterations and culling to perform a wide number of small tweaks and compare performance… I haven’t seen a good solution to this yet, it’s a really complex problem, but I think it’s inevitable.

    • RonSijm@programming.dev
      link
      fedilink
      arrow-up
      2
      arrow-down
      3
      ·
      9 months ago

      You’re getting downvoted, but it’s not the worst idea to ask it in the general sense:

      Q: is which cases is denormalization of sql tables a good idea? especially PostgreSQL?

      A:

      Denormalization of SQL tables, including PostgreSQL, might be a good idea in the following cases:

      1. Read-Heavy Workloads: When your system mostly reads data from tables rather than writing or updating them, denormalization might enhance the performance of the system. It does this by minimizing the number of joins required to fetch the data.
      2. Performance Optimization: Denormalization can speed up data retrieval times by storing related data in a single table, reducing the need for complex, time-consuming joins.
      3. Complex Queries: If the database is dealing with extremely complicated queries, then making these queries simple, clear, and less complicated can be managed by denormalizing the tables.
      4. Reporting and Analytics: In a reporting database or data warehouse, denormalization helps in faster data retrieval and simplifies queries. It can also make it easier to process aggregated data and generate reports.
      5. Data Redundancy: To increase data redundancy, you might consider denormalization. It helps in data recovery as well.
      6. Schema Simplicity: Denormalization can also simplify the application code by reducing the number of tables that need to be addressed.

      Remember, with denormalization comes redundancies and anomalies, more complex updates/inserts/deletes due to redundant data, along with increased storage costs due to redundancy. So the decision to denormalize should be made carefully considering all these aspects. It’s often a balancing act between improved read performance versus drawbacks like increased storage requirements and potential data inconsistencies.