Llama 3 8B Instruct quantized with GPTQ to fit in 10gb vRAM
2 points by jlaneve 1 year ago | 3 comments- jlaneve 1 year agoHey HN!
We've had lots of success using quantized LLMs for inference speed and cost because you can fit them on smaller GPUs (Nvidia T4, Nvidia K80, RTX 4070, etc). There's no need for everyone to quantize - we quantized Llama 3 8b Instruct to 8 bits using GPTQ and figured we'd share it with the community. Excited to see what everyone does with it!
- jaggs 1 year agoCan you get it down to an RTX 4060 VRAM level?
- jlaneve 1 year agoThis fits in about 9gb vRAM, right over the 8gb in an RTX 4060 (unless you're talking about the 16gb version). You could continue quantizing it down past 8 bits but at that point you're losing a lot more precision, so YMMV
- jlaneve 1 year ago
- jaggs 1 year ago