Meet DeepSeek: the Chinese start-up that is changing how AI models are trained

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 1 month ago

Meet DeepSeek: the Chinese start-up that is changing how AI models are trained

RedClouds@lemmygrad.ml · 1 month ago

This isn’t a super surprising result. Even American companies have been talking about how China is quickly catching up in the AI space. And if Americans are admitting it, you know it’s true. Also, anybody who’s been watching the open source scene has understood that the Chinese models are very competitive. There are many many leaderboards comparing things, but Qwen, built by Alibaba cloud, is constantly at the top of the list. In fact, in one list that I’m watching, the Qwen-based models encompass the top 20.

Then, of course, they have their own closed source language models, so a little harder to test against, but by most accounts, they are right behind ChatGPT and Claude.

DeepSeek V3 is an exceptionally large model, so it’s a little hard to do direct comparisons exactly, but it’s blowing the things out of the water, and that’s pretty crazy.

redtea@lemmygrad.ml · 1 month ago

Good points. I had thought that China was always ahead, though. The Western companies just launched the public app thing in a flashier way, while in China, it’s been used ‘behind the scenes’ rather than released for general public consumption. Now there’s some Chinese competition for the public apps and they seem to be btfo the western versions.

That is, I’d thought that China’s was ahead in AI since the start but the western LLMs took the spotlight for a while and now are losing it already to Chinese LLMs.

RedClouds@lemmygrad.ml · 1 month ago

Good distinction. China hasn’t been behind in AI in general, just this new LLM stuff. But China also uses it’s AI for more useful things instead of just advertisements and recommendation engines, though I’m sure they use it there too. But yeah, China is catching up on LLMs, and fast. The chip war has affected their ability to get faster chips, and catch up even faster, but the tech they have is sufficient, and improving faster than westerners predicted (As it always does, the west is WAY to confident in itself and WAY underestimates China’s abilities)

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 1 month ago

What’s remarkable about DeepSeek V3 is the use of mixture-of-experts approach. While it has 671 billion parameters overall, it only uses 37 billion at a time, making it very efficient. For comparison, Meta’s Llama3.1 uses 405 billion parameters used all at once. It also has 128K token context window means it can process and understand very long documents, and processes text at 60 tokens per second, twice as fast as GPT-4o.

KrasnaiaZvezda@lemmygrad.ml · 1 month ago

I’d say the fact that it was trained in FP8 is even bigger of a deal. Spending less than 6 million dollars to train something this good kinda changes how training is approached. Training has been a big problem because of the costs until now but this can make a big impact on how training can be approached.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 1 month ago

For sure, it’s revolutionary in several ways at once. In general, this shows that we’ve really only scratched the surface with this tech. It’s hard to predict what other tricks people find in the coming years. Another very interesting approach I learned about recently is neurosymbolic architecture which combines deep learning with symbolic logic. The deep learning system is used to analyze raw data and identify patterns, then encode it into tokens that a symbolic logic system can use to do actual reasoning on the data. That addresses the key weakness of LLMs making it possible to actually have the system explain how it arrived at a solution.

RedClouds@lemmygrad.ml · 1 month ago

That’s an important distinction yes, it uses a lot of smaller models added up. I haven’t been able to test it yet as I’m working with downstream tools and the raw stuff just isn’t something I’ve set up (Plus, I have like 90 gigs of ram, not… well) I read in one place you need 500 gb+ of ram to run it, so I think all 600+ billion params need to be in memory at once, and you need to use a quantized model, to get it to fit in even that space, which kinda sucks. However, that’s how it is for Mistral’s mixture of experts models too. So no difference there. MoE’s are pretty promising.

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml · 1 month ago

Exactly, it’s the approach itself that’s really valuable. Now that we know the benefits those will translate to all other models too.