Silkenweb Example: Hackernews Clone

Llama 3 8B Instruct quantized with GPTQ to fit in 10gb vRAM

2 points by jlaneve 1 year ago | 3 comments

jlaneve 1 year ago
Hey HN!
We've had lots of success using quantized LLMs for inference speed and cost because you can fit them on smaller GPUs (Nvidia T4, Nvidia K80, RTX 4070, etc). There's no need for everyone to quantize - we quantized Llama 3 8b Instruct to 8 bits using GPTQ and figured we'd share it with the community. Excited to see what everyone does with it!
- jaggs 1 year ago
  Can you get it down to an RTX 4060 VRAM level?
  - jlaneve 1 year ago
    This fits in about 9gb vRAM, right over the 8gb in an RTX 4060 (unless you're talking about the 16gb version). You could continue quantizing it down past 8 bits but at that point you're losing a lot more precision, so YMMV