Show HN: Llama 3.1 8B CPU Inference in a Browser via WebAssembly

4 points by om8 6 months ago | 4 comments
  • om8 6 months ago
    This is a demo of what's possible to run on edge devices using SOTA quantization. Other similar projects that try to run 8B models in browser are either using webgpu or 2 bit quantization that breaks the model. I implemented inference of AQLM quantized representation, making model that has 2 bit quantization and does not blow up.
    • westurner 6 months ago
      Could this be done with WebNN? If not, how should the spec change?

      Re: WebNN, window.ai, navigator.ml: https://news.ycombinator.com/item?id=40834952

      • om8 6 months ago
        Not really a WebNN expert, but looks like it doesn't support CPU inference yet and only works in Chrome. It also lacks support for custom kernels, which we need for running AQLM-quantized models.

        When you ask about spec change, do you mean WebNN spec or something else?

        • westurner 6 months ago
          There's WebNN, and WebGPU, but no WebTPU; so I guess WebNN would be it

          "Web Neural Network API" W3C Candidate Recommendation Draft https://www.w3.org/TR/webnn/

          "WebGPU" W3C Candidate Recommendation Snapshot: https://www.w3.org/TR/webgpu/

          WebGPU: https://en.wikipedia.org/wiki/WebGPU

          Tested koboldcpp with a 7B GGUF model because it should work with 8Gb VRAM; but FWIU somehow kobold pages between RAM and GPU VRAM. How does the user, in WASM, know that they have insufficient RAM for a model?