What should I use: big model-small quant or small model-no quant?

Smorty [she/her]@lemmy.blahaj.zone · edit-2 20 days ago

What should I use: big model-small quant or small model-no quant?

SGforce@lemmy.ca · 27 days ago

The technology for quantisation has improved a lot this past year making very small quants viable for some uses. I think the general consensus is that an 8bit quant will be nearly identical to a full model. Though a 6bit quant can feel so close that you may not even notice any loss of quality.

Going smaller than that is where the real trade off occurs. 2-3 bit quants of much larger models can absolutely surprise you, though they will probably be inconsistent.

So it comes down to the task you’re trying to accomplish. If it’s programming related, 6bit and up for consistency with whatever the largest coding model you can fit. If it’s creative writing or something a much lower quant with a larger model is the way to go in my opinion.

Smorty [she/her]@lemmy.blahaj.zone · 27 days ago

Hmm, so what you’re saying is that for creative generations one should use big parameter models with strong quants but when good structure is required, like with coding and JSON output, we want to use a large quant of a model which actually fits into our VRAM?

I’m currently testing JSON output, so I guess a small Qwen model it is! (they advertised good JSON generations)

Does the difference between fp8 and fp16 influence the structure strongly, or are fp8 models fine for structured content?

SGforce@lemmy.ca · 27 days ago

fp8 would probably be fine, though the method used to make the quant would greatly influence that.

I don’t know exactly how Ollama works but a more ideal model I would think would be one of these quants

https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF

A GGUF model would also allow some overflow into system ram if ollama has that capability like some other inference backends.

Smorty [she/her]@lemmy.blahaj.zone · 27 days ago

Ollama does indeed have the ability to share the memory between VRAM and RAM, but I always assumed it wouldn’t make sense, since it would massively slow down the generation.

I think ollama already uses GGUF, since that is how you import the model from HF to ollama anyway, you gotta use the *.GGUF file.

As someone who has experience with shader development in glsl, I know very well that communication between the GPU and CPU is super slow, and sending data from the GPU to the CPU is a pretty heavy task. So I just assumed it wouldn’t make any sense. I will try a full 7B model (fp16) model now using my 32GB of normal RAM to check out the speed. I’ll edit this comment once I’m done and share results

SGforce@lemmy.ca · 27 days ago

With modern methods sometimes running a larger model split between GPU/CPU can be fast enough. Here’s an example https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

Smorty [she/her]@lemmy.blahaj.zone · 27 days ago

oooh a windows only feature, now I see why I haven’t heard of this yet. Well, too bad I guess. It’s time to switch to AMD for me anyway…

ffhein@lemmy.world · 25 days ago

Article is written in a bit confusing way, but you’ll most likely want to turn off Nvidia’s automatic VRAM swapping if you’re on Windows, so it doesn’t happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it’s easier to find how many layers you can offload if it fails to load instead when you set it too high.

Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven’t checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.

SGforce@lemmy.ca · 27 days ago

Oh, that part is. But the splitting tech is built into llama.cpp