Hi,
I’m looking for something that could generate code and provide technical assistance on a level similar to ChatGPT4 or at least 3.5. I’m generally satisfied with it, but for privacy and security reasons I can’t share some details and code listings with OpenAI. Hence, I’m looking for a self-hosted alternative.
Any recommendations? If nothing specific comes to mind, what parameters should I look at in my search? I’ve never worked with LLMs yet and there are so many of them. I just know that I could use oobabooga/text-generation-webui to access a model in a friendly way.
Thanks in advance.
You can keep an eye on p9. The aim is to make a decent (maybe even good) quality local copilot LLM.
Hopefully they’ll succeed! Thanks for the recommendation.
Specifically on what LLM to use, I’ve been meaning to try Starcoder, but can’t vouch for how good it is. In general I’ve found Vicuna-13B pretty good at generating code.
As for general recommendations, I’d say the main determinant will be if you can afford the hardware requirements to locally host - I presume you’re familiar with the fact that you’ll (usually) need roughly 2x the number of parameters in VRAM (e.g. 7B parameters means 14GB of VRAM). Techniques like quantization to 8-bits halve the requirement, with the more extreme 4-bit quantization halving them again (at the expense of generation quality).
And if you don’t have enough VRAM, there’s always llama.cpp - I think that list of supported models is outdated, and it supports way more than those.
On the “what software to use for self-hosting” I’ve quite liked FastChat, they even have a way to run an OpenAI API compatible server, which will be useful if your tools expect OpenAI.
Hope this is helpful!
Thanks you for the information and suggestions!
There is a bit of a conundrum here: in order to run a model that is any good in coding you want it to have a lot of parameters (the more the better) but also since it’s code and not some spoken language - precision matters here. Home hardware like 3090 is able to run ~30b models, but there is a catch - it just fits and only in quantized form = with 4x worse precision typically. Unless we see some breakthrough here that makes inference of huge models possible at full precision - the hosted AI will always be better for coding. Not saying such breakthrough is impossible though - quite the opposite in my opinion.