Faster Ollama alternative

RandomlyRight@sh.itjust.works · 12 hours ago

Faster Ollama alternative

theunknownmuncher@lemmy.world · edit-2 11 hours ago

Ummm… did you try /set parameter num_ctx # and /set parameter num_predict #? Are you using a model that actually supports the context length that you desire…?

RandomlyRight@sh.itjust.works · 5 hours ago

Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines

theunknownmuncher@lemmy.world · edit-2 19 minutes ago

Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization