I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output
Yeah llama.cpp with SuperHOT support would be great, and yeah I’m using exllama with oobabooga UI.
I found out why I’m getting garbage output with 2k. It seems like SuperHOT 8K models, when run with 2k context, have a massive increase in perplexity.
(Higher perplexity, the worse the output quality).
So I’ll need to figure out if I can get at least 4K running without running out of VRAM.
Also, there is a new PR for exllama which uses a different method of getting higher context (not SuperHOT) and also has less perplexity loss. So that might be a better alternative potentially.
I read the guy’s blog post on SuperHOT and it sounded like it didn’t increase perplexity and kept perplexity super low with large contexts. I could have read it wrong but I thought it wasn’t supposed to increase perplexity.
The increase in perplexity is very small, but there is still some with 8K content. But it seems like with 2K its much larger. I could be misunderstanding something myself. But my little test with 2K context does suggest there’s something going on with 2K contexts on SuperHOT models
I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output
Yeah llama.cpp with SuperHOT support would be great, and yeah I’m using exllama with oobabooga UI. I found out why I’m getting garbage output with 2k. It seems like SuperHOT 8K models, when run with 2k context, have a massive increase in perplexity.
(Higher perplexity, the worse the output quality).
So I’ll need to figure out if I can get at least 4K running without running out of VRAM.
Also, there is a new PR for exllama which uses a different method of getting higher context (not SuperHOT) and also has less perplexity loss. So that might be a better alternative potentially.
I read the guy’s blog post on SuperHOT and it sounded like it didn’t increase perplexity and kept perplexity super low with large contexts. I could have read it wrong but I thought it wasn’t supposed to increase perplexity.
The increase in perplexity is very small, but there is still some with 8K content. But it seems like with 2K its much larger. I could be misunderstanding something myself. But my little test with 2K context does suggest there’s something going on with 2K contexts on SuperHOT models