Silkenweb Example: Hackernews Clone

Show HN: Llama 3.1 8B CPU Inference in a Browser via WebAssembly

4 points by om8 6 months ago | 4 comments

om8 6 months ago
This is a demo of what's possible to run on edge devices using SOTA quantization. Other similar projects that try to run 8B models in browser are either using webgpu or 2 bit quantization that breaks the model. I implemented inference of AQLM quantized representation, making model that has 2 bit quantization and does not blow up.
- westurner 6 months ago
  Could this be done with WebNN? If not, how should the spec change?
  Re: WebNN, window.ai, navigator.ml: https://news.ycombinator.com/item?id=40834952
  - om8 6 months ago
    Not really a WebNN expert, but looks like it doesn't support CPU inference yet and only works in Chrome. It also lacks support for custom kernels, which we need for running AQLM-quantized models.
    When you ask about spec change, do you mean WebNN spec or something else?
    - westurner 6 months ago
      There's WebNN, and WebGPU, but no WebTPU; so I guess WebNN would be it
      "Web Neural Network API" W3C Candidate Recommendation Draft https://www.w3.org/TR/webnn/
      "WebGPU" W3C Candidate Recommendation Snapshot: https://www.w3.org/TR/webgpu/
      WebGPU: https://en.wikipedia.org/wiki/WebGPU
      Tested koboldcpp with a 7B GGUF model because it should work with 8Gb VRAM; but FWIU somehow kobold pages between RAM and GPU VRAM. How does the user, in WASM, know that they have insufficient RAM for a model?