I’m currently shopping around for something a bit faster than ollama and because I could not get it to use a different context and output length, which seems to be a known and long ignored issue. Somehow everything I’ve tried so far did miss one or more critical features, like:

  • “Hot” model replacement, so loading and unloading models on demand
  • Function calling
  • Support of most models
  • OpenAI API compatibility (to work well with Open WebUI)

I’d be happy about any recommendations!

  • theunknownmuncher@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    1
    ·
    edit-2
    11 hours ago

    Ummm… did you try /set parameter num_ctx # and /set parameter num_predict #? Are you using a model that actually supports the context length that you desire…?

    • RandomlyRight@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines

      • theunknownmuncher@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        19 minutes ago

        Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization