That’s an important distinction yes, it uses a lot of smaller models added up. I haven’t been able to test it yet as I’m working with downstream tools and the raw stuff just isn’t something I’ve set up (Plus, I have like 90 gigs of ram, not… well) I read in one place you need 500 gb+ of ram to run it, so I think all 600+ billion params need to be in memory at once, and you need to use a quantized model, to get it to fit in even that space, which kinda sucks. However, that’s how it is for Mistral’s mixture of experts models too. So no difference there. MoE’s are pretty promising.
That’s an important distinction yes, it uses a lot of smaller models added up. I haven’t been able to test it yet as I’m working with downstream tools and the raw stuff just isn’t something I’ve set up (Plus, I have like 90 gigs of ram, not… well) I read in one place you need 500 gb+ of ram to run it, so I think all 600+ billion params need to be in memory at once, and you need to use a quantized model, to get it to fit in even that space, which kinda sucks. However, that’s how it is for Mistral’s mixture of experts models too. So no difference there. MoE’s are pretty promising.
Exactly, it’s the approach itself that’s really valuable. Now that we know the benefits those will translate to all other models too.