Bringing K/V context quantisation to Ollama

220 points by mchiang 7 months ago | 32 comments
  • smcleod 7 months ago
    Shout out to everyone from Ollama and the wider community that helped with the reviews, feedback and assistance along the way. It's great to contribute to such a fantastic project.
    • octocop 7 months ago
      shout out to llama.cpp
    • smcleod 7 months ago
      Today I ran some perplexity benchmarks comparing F16 and Q8_0 for the K/V, I used Qwen 2.5 Coder 7b as I've heard people say things to the effect of Qwen being more sensitive to quantisation than some other models.

      Well, it turns out there's barely any increase in perplexity at all - an increase of just 0.0043.

      Added to the post: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

      • satvikpendem 7 months ago
        What's the best way to use Ollama with a GUI, just OpenWebUI? Any options as well for mobile platforms like Android (or, I don't even know if we can run LLMs on the phone in the first place).
        • qudat 7 months ago
          For hosting a web gui for ollama I use https://tuns.sh

          It really convenient because it's just an SSH tunnel and then you get automatic TLS and it protects your home IP.

          With that you can access it from your mobile phone, just gotta require a password to access it.

        • huijzer 7 months ago
          I have Open WebUI on a Hetzner instance connected to Deep Infra. Works on mobile by turning the web page into an app. I find the web framework that WebUI uses quite bloated/slow, but apart from that it does work reliably. Price at Deep Infra is typically about $0.04 per month even when actively asking lots of questions during programming.
          • sadeshmukh 7 months ago
            A lot of the UIs, including OpenWebUI have the feature to expose over LAN with users - that's what I did to use my GPU while still being on my phone. Not entirely sure about native UIs though.

            Also, I normally use Groq's (with a q) API since it's really cheap with no upfront billing info required - it's a whole order of magnitude cheaper iirc than OpenAI/Claude. They literally have a /openai endpoint if you need compatibility.

            You can look in the direction of Google's Gemma if you need a lightweight open weights LLM - there was something there that I forgot.

            • rkwz 7 months ago
              If you’re using a Mac, I’ve built a lightweight native app - https://github.com/sheshbabu/Chital
              • antirez 7 months ago
                That's very cool, finally a native app that runs fast. Thanks.
                • rkwz 7 months ago
                  Thanks for the kind words :)
              • smcleod 7 months ago
                I personally use a mix of Open WebUI, Big AGI, BoltAI, AnythingLLM on the desktop. The mobile space is where things are really lacking at the moment, really I just end up browsing to Open WebUI but that's not ideal. I'd love a iOS native client that's well integrated into Siri, Shortcuts, Sharing etc...
                • magicalhippo 7 months ago
                  Aa a Windows user, who just wanted something bare bones for playing, I found this[1] small project useful. It does support multi-modal models which is nice.

                  [1]: https://github.com/jakobhoeg/nextjs-ollama-llm-ui

                  • paradite 7 months ago
                    I built a custom GUI for coding tasks specifically, with built-in code context management and workspaces:

                    https://prompt.16x.engineer/

                    Should work well if you have 64G vRAM to run SOTA models locally.

                    • accrual 7 months ago
                      Great looking GUI, I find simple black/white/boxy/monospace UIs very effective.
                      • throwaway314155 7 months ago
                        > Should work well if you have 64G vRAM to run SOTA models locally.

                        Does anyone have this?

                        edit: Ah, it's a Mac app.

                        • gzer0 7 months ago
                          M4 Max with 128 GB RAM here. ;) Love it. A very expensive early Christmas present.
                          • paradite 7 months ago
                            Yeah Mac eats Windows on running LLMs.

                            My app does support Windows though, you can connect to OpenAI, Claude, OpenRouter, Azure and other 3rd party providers. Just running SOTA LLMs locally can be challenging.

                          • 7 months ago
                          • vunderba 7 months ago
                            As far as open source goes, I'd probably recommend LibreChat. It has connections for ollama, openai, anthropic, etc. It let's you setup auth so you can theoretically use it from anywhere (phone, etc.).

                            Fair warning, it's relatively heavyweight in so far as it has to spin up a number of docker instances but works very well.

                            https://github.com/danny-avila/LibreChat

                            • zerop 7 months ago
                              Many are there, apart from what others mentioned I am exploring Anything LLM - https://anythingllm.com/. Liked the workspace concept in it. We can club documents in workspaces and RAG scope is managed.
                              • 7 months ago
                                • seb314 7 months ago
                                  For running llms _locally_ on Android, there's "pocketpal" (~7tok/s on a pixel 7 pro for some quant of llama 3.2 3B).

                                  (Not sure if it uses ollama though)

                                • 7 months ago
                                  • lastdong 7 months ago
                                    Great project! Do you think there might be some advantages to bringing this over to LLaMA-BitNet?
                                    • wokwokwok 7 months ago
                                      Nice.

                                      That said... I mean...

                                      > The journey to integrate K/V context cache quantisation into Ollama took around 5 months.

                                      ??

                                      They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:

                                      The commit (1) is:

                                          > params := C.llama_context_default_params()
                                          > ...
                                          > params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
                                          > params.type_v = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
                                      
                                      Which has been part of llama.cpp since Dev 7, 2023 (2).

                                      So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.

                                      > It took 5 months, but we got there in the end.

                                      ... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?

                                      [1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...

                                      [2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...

                                      • meesles 7 months ago
                                        Author describes why it took as long as it did in the post, so I don't think they're trying to be disingenous. Getting minor changes merged upstream in large projects is difficult for newer concepts since you need adoption and support.

                                        Full release seems to contain more code[1], and author references the llama.cpp pre-work and that author as well

                                        This person is also not a core contributor, so this reads as a hobbyist and fan of AI dev that is writing about their work. Nothing to be ashamed of IMO.

                                        [1] - https://github.com/ollama/ollama/compare/v0.4.7...v0.4.8-rc0

                                        • smcleod 7 months ago
                                          > this reads as a hobbyist and fan of AI dev that is writing about their work

                                          Bingo, that's me!

                                          I suspect the OP didn't actually read the post.

                                          1. As you pointed out, it's about getting the feature working, enabled and contributed into Ollama, not in llama.cpp

                                          2. Digging through git commits isn't useful when you work hard to squash commits before merging a PR, there were a _lot_ over the last 5 months.

                                          3. While I'm not a go dev (and the introduction of cgo part way through that threw me a bit) there certainly were technicalities along the way, I suspect they not only didn't both to read the post, they also didn't bother to read the PR.

                                          Also, just to clarify - I didn't even share this here, it's just my personal blog of things I try to remember I did when I look back at them years later.

                                        • guywhocodes 7 months ago
                                          This is par for Ollama, look at the log_probs issues/prs and you get an idea of how well Ollama is run.

                                          Ollama is IMO a model downloader for llama.cpp so you can do roleplay with ease.

                                          • smcleod 7 months ago
                                            I'm going to be generous here and assume you didn't bother to actually read the post (or even the PR) before writing a snaky, non-constructive comment, but skimming through your HN comment history this appears to be on-brand.
                                            • wokwokwok 7 months ago
                                              I'll be generous and just say, maybe people should just use llama.cpp and not ollama if they care about having nice things, if merging support for existing features is that difficult.

                                              It seems like it's probably a better choice overall.

                                              That said, I'm sure people worked very hard on this, and it's nice to see it as a part of ollama for the people that use it.

                                              Also:

                                              > Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

                                              https://news.ycombinator.com/newsguidelines.html

                                              • smcleod 7 months ago
                                                Im not sure what kind of vendetta you have against Ollama but I'll paste you here what I've written before when I've heard claims similar to Ollama is just a wrapper for llama.cpp:

                                                With llama.cpp running on a machine, how do you connect your LLM clients to it and request a model gets loaded with a given set of parameters and templates?

                                                … you can’t, because llama.cpp is the inference engine - and it’s bundled llama-cpp-server binary only provides relatively basic server functionality - it’s really more of demo/example or MVP.

                                                Llama.cpp is all configured at the time you run the binary and manually provide it command line args for the one specific model and configuration you start it with.

                                                Ollama provides a server and client for interfacing and packaging models, such as:

                                                  Hot loading models (e.g. when you request a model from your client Ollama will load it on demand).
                                                  Automatic model parallelisation.
                                                  Automatic model concurrency.
                                                  Automatic memory calculations for layer and GPU/CPU placement.
                                                  Layered model configuration (basically docker images for models).
                                                  Templating and distribution of model parameters, templates in a container image.
                                                  Near feature complete OpenAI compatible API as well as it’s native native API that supports more advanced features such as model hot loading, context management, etc…
                                                  Native libraries for common languages.
                                                  Official container images for hosting.
                                                  Provides a client/server model for running remote or local inference servers with either Ollama or openai compatible clients.
                                                  Support for both an official and self hosted model and template repositories.
                                                  Support for multi-modal / Vision LLMs - something that llama.cpp is not focusing on providing currently.
                                                  Support for serving safetensors models, as well as running and creating models directly from their Huggingface model ID.
                                                  In addition to the llama.cpp engine, Ollama are working on adding additional model backends.
                                                
                                                
                                                Ollama is not “better” or “worse” than llama.cpp because it’s an entirely different tool.
                                            • yard2010 7 months ago
                                              "Judge not, that ye be not judged"
                                              • 7 months ago
                                                • 7 months ago