What’s remarkable about DeepSeek V3 is the use of mixture-of-experts approach. While it has 671 billion parameters overall, it only uses 37 billion at a time, making it very efficient. For comparison, Meta’s Llama3.1 uses 405 billion parameters used all at once. It also has 128K token context window means it can process and understand very long documents, and processes text at 60 tokens per second, twice as fast as GPT-4o.
I’d say the fact that it was trained in FP8 is even bigger of a deal. Spending less than 6 million dollars to train something this good kinda changes how training is approached. Training has been a big problem because of the costs until now but this can make a big impact on how training can be approached.
For sure, it’s revolutionary in several ways at once. In general, this shows that we’ve really only scratched the surface with this tech. It’s hard to predict what other tricks people find in the coming years. Another very interesting approach I learned about recently is neurosymbolic architecture which combines deep learning with symbolic logic. The deep learning system is used to analyze raw data and identify patterns, then encode it into tokens that a symbolic logic system can use to do actual reasoning on the data. That addresses the key weakness of LLMs making it possible to actually have the system explain how it arrived at a solution.
That’s an important distinction yes, it uses a lot of smaller models added up. I haven’t been able to test it yet as I’m working with downstream tools and the raw stuff just isn’t something I’ve set up (Plus, I have like 90 gigs of ram, not… well) I read in one place you need 500 gb+ of ram to run it, so I think all 600+ billion params need to be in memory at once, and you need to use a quantized model, to get it to fit in even that space, which kinda sucks. However, that’s how it is for Mistral’s mixture of experts models too. So no difference there. MoE’s are pretty promising.
What’s remarkable about DeepSeek V3 is the use of mixture-of-experts approach. While it has 671 billion parameters overall, it only uses 37 billion at a time, making it very efficient. For comparison, Meta’s Llama3.1 uses 405 billion parameters used all at once. It also has 128K token context window means it can process and understand very long documents, and processes text at 60 tokens per second, twice as fast as GPT-4o.
I’d say the fact that it was trained in FP8 is even bigger of a deal. Spending less than 6 million dollars to train something this good kinda changes how training is approached. Training has been a big problem because of the costs until now but this can make a big impact on how training can be approached.
For sure, it’s revolutionary in several ways at once. In general, this shows that we’ve really only scratched the surface with this tech. It’s hard to predict what other tricks people find in the coming years. Another very interesting approach I learned about recently is neurosymbolic architecture which combines deep learning with symbolic logic. The deep learning system is used to analyze raw data and identify patterns, then encode it into tokens that a symbolic logic system can use to do actual reasoning on the data. That addresses the key weakness of LLMs making it possible to actually have the system explain how it arrived at a solution.
That’s an important distinction yes, it uses a lot of smaller models added up. I haven’t been able to test it yet as I’m working with downstream tools and the raw stuff just isn’t something I’ve set up (Plus, I have like 90 gigs of ram, not… well) I read in one place you need 500 gb+ of ram to run it, so I think all 600+ billion params need to be in memory at once, and you need to use a quantized model, to get it to fit in even that space, which kinda sucks. However, that’s how it is for Mistral’s mixture of experts models too. So no difference there. MoE’s are pretty promising.
Exactly, it’s the approach itself that’s really valuable. Now that we know the benefits those will translate to all other models too.