@blabboy

blabboy · 1 year ago

Ah yeah I’ve seen this document, it is arguing that we need both quality and quantity. Really, without quantity we cannot scale these large deep learning algorithms (check out the Chinchilla paper from DeepMind for example, they estimate we need 20x the number of training tokens per model parameter for optimal training).

blabboy · 1 year ago

I work with LLMs, and yes the barrier currently is needing more data. These models get better when they are larger and trained on more data, so you really need all the data you can get your hands on.