This was excellent and actually something I was looking for. Thank you so much <3
Holy Flurking Schnitt. So . . . no one understands exactly how it works or even how it works past the first couple of abstractions.
That explains so much.
This is a really terrific explanation. The author puts some very technical concepts into accessible terms, but not so far from reality as to cloud the original concepts. Most other attempts I’ve seen at explaining LLMs or any other NN-based pop tech are either waaaay oversimplified, heavily abstracted, or are meant for a technical audience and are dry and opaque. I’m saving this for sure. Great read.
In the language of classical probability theory: the models learn the probability distribution of words in language from their training data, and then approximate this distribution using their parameters and network structure.
When given a prompt, they then calculate the conditional probabilities of the next word, given the words they have already seen, and sample from that space.
It is a rather simple idea, all of the complexity comes from trying to give the high-dimensional vector operations (that it is doing to calculate conditional probabilities) a human meaning.
I’d like to add one more layer to this great explanation.
Usually, this kind of predictions should be made in two steps:
-
calculate the conditional probability of the next word (given the data), for all possible candidate words;
-
choose one word among these candidates.
The choice in step 2. should be determined, in principle, by two factors: (a) the probability of a candidate, and (b) also a cost or gain for making the wrong or right choice if that candidate is chosen. There’s a trade-off between these two factors. For example, a candidate might have low probability, but also be a safe choice, in the sense that if it’s the wrong choice no big problems arise – so it’s the best choice. Or a candidate might have high probability, but terrible consequences if it were the wrong choice – so it’s better to discard it in favour of something less likely but also less risky.
This is all common sense! but it’s at the foundation of the theory behind this (Decision Theory).
The proper calculation of steps 1. and 2. together, according to fundamental rules (probability calculus & decision theory) would be enormously expensive. So expensive that something like chatGPT would be impossible: we’d have to wait for centuries (just a guess: could be decades or millennia) to train it, and then to get an answer. This is why Large Language Models do two approximations, which obviously can have serious drawbacks:
-
they use extremely simplified cost/gain figures – in fact, from what I gather, the researchers don’t have any clear idea of what they are;
-
they directly combine the simplified cost/gain figures with probabilities;
-
They search for the candidate with the highest gain+probability combination, but stopping as soon as they find a relatively high one – at the risk of missing the one that was actually the real maximum.
(Sorry if this comment has a lecturing tone – it’s not meant to. But I think that the theory behind these algorithms can actually be explained in very common-sense term, without too much technobabble, as @TheChurn’s comment showed.)
-
Superb summary!
Very interesting. But also very complex.
In a very badly and short as possible way, they are very complex probabilities machines, which can compare the probability of each words in your sentence to chose the words to say to you.
There is also Kyle Hill who explained how Chatgpt works in a video https://youtu.be/-4Oso9-9KTQ
however from what I remember I think the article has more info on how the tool manages to reduce confusion and history on the evolution with gpt 1, 2 and 3.
But what helped me understand easier was the video, even if it doesn’t describe every thing to the tiniest detail.
“As a result, no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take years—perhaps decades—to complete.”
Maybe I missed it in the article, but can someone please explain-like-i’m-5 how this is possible.
It’s not like we are interacting with a biologic with mysterious chemistry. Everything about LLMs are completely man-made.
It’s not it’s biological origins that make it hard to understand the brain, but the complexity. For example, we understand how the heart works pretty well.
While LLMs are nowhere near as complex as a brain, they’re complex enough to make it extremely difficult to understand.
But then there comes the question: if they’re so difficult to understand, how did people make them in the first place?
The way they did it actually bears some similarities to evolution. They created an “empty” model - a large neural network that wasn’t doing anything useful or meaningful. But it depended on billions of parameters, and if you tweak a parameter, its behavior changes slightly.
Then they expended enormous amount of computing power tweaking parameters, each tweak slightly improving its ability to model language. While doing this, they didn’t know what each number meant. They didn’t know how or why each tweak was improving the model. Just that each tweak was making an improvement.
Unlike evolution, each tweak isn’t random. There’s an algorithm called back-propagation that can tell you how to tweak the neural network to make it predict some known data slightly better. But unfortunately it doesn’t tell you anything about the “why” this tweak is good, or “what” each parameter change means. Hence why we don’t understand how LLMs work.
One final clarification: It’s not a complete black box. We do have some understanding of how LLM works, mostly on high level. Kind of like we have some basic understanding of how a brain works. We understand LLMs much better than brains, of course.
We don’t understand it because no one designed it. We designed how to train a nn, we designed some parts of the structure, but not the individual parts inside. For the largest LLMs there are upwards of 70 billion different parameters. Each being individual numbers they were can tweak. The are just too many of them to understand what any individual one does, and since we just left a optimization algorithm do it’s optimizing we can’t really even know what groups of them do.
We can get around this, we can study it like we do the brain. Instead of looking at what an individual part does, group them together and figure out how they group influences things (AI explanability), or even get a different NN to look at it and generate an explanation (post hoc rationale generation). But that’s not really the same as actually understand what it is actually doing under the hood. What it is doing under the hood is more or less fundamentally unknowable, there is just to much information and it’s not well organized enough for us to be able to understand. Maybe one day we will be able to abstract what is going on in there and organize it in an understandable manner, but not yet.
Very interesting read.
Awesome! I always wondered how Skynets got made.