I’ve been messing around with GPTQ models with ExLlama in ooba, and have gotten 33b models @ 3k running smoothly, but was looking to try something bigger than my VRAM can hold.

However, I’m clearly doing something wrong, and the koboldcpp.exe documentation isn’t clear to me. Does anyone have a good setup guide? My understanding is koboldcpp.exe is preferable for GGML, as ooba’s llama.cpp doesn’t support GGML at >4k context yet.

  • h3ndrik@feddit.de
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    1 year ago

    KoboldCpp has documentation on the github page. Maybe just google for other guides if the documentation doesn’t do it for you.

    My advice is: Do one step at a time. Get it running first, without fancy stuff. Start with a small model and without gpu acceleration. Then get the acceleration/CUDA working. Then try with a bigger model. And then you can do the elaborate stuff like having some layers in VRAM and others in RAM and blowing up the context size past 2048/default. Don’t do it all at once. That way you might figure out your problem and at which of the steps it happens.

    (Edit: And make sure to always use the latest version. You’re playing with pretty recent stuff that still might have bugs.)

    I can’t say much about the windows stuff or the state of the integration layers in oobabooga’s.