upvote
I have been getting good results with IQ4_NL and TurboQuant at 4bits on 24gb (3090). It easily fits 256k with that setup, but it starts slowing down quite a bit after 80-100k. Quality in my testing is also still good:

- Coding task test: https://github.com/sleepyeldrazi/llm_programming_tests/ - Design task test: https://github.com/sleepyeldrazi/llm-design-showcase

Coding was against minimax-m2.7 and glm-5, and the design against other small models

reply