It seems pretty clear that all of Huffman’s recent decisions are driven by Reddit’s hoped-for IPO. On one front is the ugly fact that Reddit’s valuation is sinking.

  • blabboy
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 year ago

    I work with LLMs, and yes the barrier currently is needing more data. These models get better when they are larger and trained on more data, so you really need all the data you can get your hands on.

      • Em Adespoton@lemmy.ca
        link
        fedilink
        English
        arrow-up
        5
        ·
        1 year ago

        That’s actually where Reddit is useful as a training corpus, because different subreddits are at different levels of quality. It’s pretty easy to identify the high quality ones for training answers, and the low quality ones are excellent for training basic transforms (making sense out of an input that is niche and flawed in some way).

        There are very few other sources of lightly structured training data that span all of humanity broken down into topics, graded to different levels of quality. Over time, the data will become less relevant as society moves on, so a living training set is important.

        Having said that, Lemmy could prove to be an even better training source for expert system LLMs, as there could be curated instances of high quality with the ability to pull in more federated data as needed.

      • blabboy
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Ah yeah I’ve seen this document, it is arguing that we need both quality and quantity. Really, without quantity we cannot scale these large deep learning algorithms (check out the Chinchilla paper from DeepMind for example, they estimate we need 20x the number of training tokens per model parameter for optimal training).