Show HN: Llama 3.1 8B CPU Inference in a Browser via WebAssembly
4 points by om8 6 months ago | 4 comments- om8 6 months agoThis is a demo of what's possible to run on edge devices using SOTA quantization. Other similar projects that try to run 8B models in browser are either using webgpu or 2 bit quantization that breaks the model. I implemented inference of AQLM quantized representation, making model that has 2 bit quantization and does not blow up.
- westurner 6 months agoCould this be done with WebNN? If not, how should the spec change?
Re: WebNN, window.ai, navigator.ml: https://news.ycombinator.com/item?id=40834952
- om8 6 months agoNot really a WebNN expert, but looks like it doesn't support CPU inference yet and only works in Chrome. It also lacks support for custom kernels, which we need for running AQLM-quantized models.
When you ask about spec change, do you mean WebNN spec or something else?
- westurner 6 months agoThere's WebNN, and WebGPU, but no WebTPU; so I guess WebNN would be it
"Web Neural Network API" W3C Candidate Recommendation Draft https://www.w3.org/TR/webnn/
"WebGPU" W3C Candidate Recommendation Snapshot: https://www.w3.org/TR/webgpu/
WebGPU: https://en.wikipedia.org/wiki/WebGPU
Tested koboldcpp with a 7B GGUF model because it should work with 8Gb VRAM; but FWIU somehow kobold pages between RAM and GPU VRAM. How does the user, in WASM, know that they have insufficient RAM for a model?
- westurner 6 months ago
- om8 6 months ago
- westurner 6 months ago