Silkenweb Example: Hackernews Clone

How to run DeepSeek R1 locally

84 points by grinich 5 months ago | 32 comments

lxe 5 months ago
These mini models are NOT the DeepSeek R1.
> DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.
The amount of confusion on the internet because of this seems surprisingly high. DeepSeek R1 has 670B parameters, and it's not easy to run it on local hardware.
There are some ways to run it locally, like https://unsloth.ai/blog/deepseekr1-dynamic which should let you fit the dynamic quant into 160GBs of VRAM, but the quality will suffer.
Also MLX attempt on a cluster of Mac Ultras: https://x.com/awnihannun/status/1881412271236346233
- vunderba 5 months ago
  Agreed - these are Qwen/Llama that have been finetuned against data FROM Deepseek. It kind of annoys me that the names of these models hosted on Ollama start with "DeepSeek-R1-XXX-XXX" since I think it's confusing a lot of people.
- drillsteps5 5 months ago
  The 2 DeepSeek R1 distilled models available through Ollama are actually very low quality Qwen and Llama models frankensteinged with DeepSeek R1. They cannot be used to judge capabilities of the original DeepSeek R1 in any way, shape, or form.
  Ollama and LM Studio cause so much confusion because people simply use their pre-packaged models and not exploring and comparing to what else is available on the market (HuggingFace).
  - diggan 5 months ago
    Ollama makes it kind of cumbersome to download straight from Huggingface, unless something changed lately. It doesn't help that they felt the need to store files differently on disk either (inspired by docker, seemingly), making it even harder to share stuff between applications.
    LM Studio though allows you to browse and download straight from Huggingface (assuming GGUF), so people could spend more time looking for models, but I don't think many have the interest to do what many of us do, download 10s of models and compare them against each other to find the best one for our use case.
- deeviant 5 months ago
  You are correct, except it's not that hard to run locally. Here, somebody made a 6k rig to do it: https://x.com/carrigmat/status/1884244369907278106
  - abound 5 months ago
    For people who don't want to click into X or don't have an account to see the thread: The rig in question uses dual-socket EPYCs and no GPUs, relying on 768 GB of fast DDR5 RAM. It gets about 6 - 8 tokens/sec for the full DeepSeek R1 model.
    - lukan 5 months ago
      That Link is being discussed as well:
      https://news.ycombinator.com/item?id=42865575
- pdntspa 5 months ago
  I haven't played with the distillations extensively, but I think the internal monologue helps the LLM produce a higher quality output, even if its just Qwen or Llama
- lherron 5 months ago
  ollama is making this worse by not denoting what you’re getting in the model name. You have to look at individual model cards to see it’s distilled.
  - lxe 5 months ago
    Both ollama and groq are spreading this misinfo for some reason.
- z3t4 5 months ago
  So what hardware do I need to run DeepSeek R1 with 670B parameters?
  - paxys 5 months ago
    https://www.reddit.com/r/LocalLLaMA/comments/1ic8cjf/6000_co...
    According to this, you can fit it on a CPU-only setup (no GPUs) with 2 x AMD EPYC CPUs and 24 x 32GB DDR5-RDIMM RAM. About $6000 MSRP for the rig. Doubt you are going to get very many tokens/sec out of it though (6-8, according to the author).
    - lxe 5 months ago
      Pretty impressive TPS numbers for CPU-only
  - kristjansson 5 months ago
    Per the technical report:
    > The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs.
    but realistically, >=671GB of VRAM to run at full precision on GPU, or >=131G VRAM to run the most heavily quantized version[1], or >=671GB RAM, and a dose of patience to run on CPU.
    [1]: https://news.ycombinator.com/item?id=42850222
- makerdiety 4 months ago
  Dang, that's a bad oversight. It's truly misinformation indeed.
Flux159 5 months ago
To run the full 671B Q8 model relatively cheaply (around $6k), you can get a dual EPYC server with 768GB RAM - CPU inference only at around 6-8 tokens/sec. https://x.com/carrigmat/status/1884244369907278106
There are a lot of low quant ways to run in less RAM, but the quality will be worse. Also, running a distill is not the same thing as running the larger model, so unless you have access to an 8xGPU server with lots of VRAM (>$50k), cpu inference is probably your best bet today.
If the new M4 Ultra Macs have 256GB unified RAM as expected, then you may still need to connect 3 of them together via Thunderbolt 5 in order to have enough RAM to run the Q8 model. Assuming that the speed of that will be faster than the EPYC server, but will need to test empirically once that machine is released.
- DrPhish 5 months ago
  “you can get a dual EPYC server with 768GB RAM - CPU inference only at around 6-8 tokens/sec.”
  This is what I run at home. I built it just over a year ago and have run every single model that has been released.
- j45 5 months ago
  For a second I thought I had missed the M4 Ultra macs coming out.
coder543 5 months ago
> By default, this downloads the main DeepSeek R1 model (which is large). If you’re interested in a specific distilled variant (e.g., 1.5B, 7B, 14B), just specify its tag
No… it downloads the 7B model by default. If you think that is large, then you better hold on to your seat when you try to download the 671B model.
- robotnikman 5 months ago
  >then you better hold on to your seat when you try to download the 671B model.
  I ended up downloading it in case it ever gets removed off the internet for whatever reason. Who knows, if VRAM becomes much cheaper in 10 years I might be able to run it locally without spending a fortune on GPUs!
- mdp2021 5 months ago
  Maybe you could share information about the size of the different packages?
  - coder543 5 months ago
    They’re all listed here: https://ollama.com/library/deepseek-r1/tags
jascha_eng 5 months ago
None of these models are the real Deepseek R1 that you can access via the API or chat! The big one is a quantized version (it uses 4 bit per weight) and even that you probably cant run.
The other ones are fine-tunes of LLama 3.3 and Qwen2 which have been additionally trained on outputs of the big "Deepseek V3 + R1" model.
I'm happy people are looking into selfhosting models, but if you want to get an idea of what R1 can do, this is not a good way to do so.
kristjansson 5 months ago
Ollama is doing the community a serious disservice by presenting the various distillations of R1 as different versions of the same model. They're good improvements on their base models (on reasoning benchmarks, at least), but grouping them all under the same heading masks the underlying differences and contributions of the base models. I know they have further details on the page for each tag, but it still seems seriously misleading.
paradite 5 months ago
It's worth mentioning the fact that you need to adjust the context window manually for coding tasks (default 2k is not enough for coding tasks).
Here's how to run deepseek-r1:14b (DeepSeek-R1-Distill-Qwen-14B) and set it to 8k context window:
```
  ollama run deepseek-r1:14b
  /set parameter num_ctx 8192
  /save deepseek-r1:14b-8k

  ollama serve
```
Hizonner 5 months ago
Those distilled models are actually pretty good, but they're more Qwen than R1.
rcarmo 5 months ago
FYI, the 14b model fits and runs fine on a 12GB NVIDIA 3060, and is pretty capable, but still frustrating to use: https://news.ycombinator.com/item?id=42863228
_0xdd 5 months ago
14B model runs pretty quickly on an MacBook Pro with and M1 Max and 64 GB of RAM.
- xtracto 5 months ago
  It is so exciting to see that we are once again in a position where hardware is the main bottleneck for new Computing/Software technologies.
  I remember back in late 80s (I was a kid) when we had MSDOS and Windows was just starting. We had all kinds of crazy stuff like Quattro Pro, and later in early 90s we had Wolf3D and all sort of software that was limited due to hardware advances.
  With the current state of LLMs... imagine how it is going to look like in 10/15 years when the hardware race goes raging again.
  It is also very interesting the position of the different countries: China is in a great position economically and with the amount of talent it has; in addition to its culture and political direction (centralized government that finds it easy to direct policies quickly). Meanwhile, the US is in a more fragile stance, with lots of internal fighting and even pushing [intelligent] people out of the country.
  The previous race was to get to the moon (US vs Russia), it seems we now are watching a race to AGI.
  - TMWNN 5 months ago
    > Meanwhile, the US is in a more fragile stance, with lots of internal fighting and even pushing [intelligent] people out of the country.
    Oh, good grief. Amazing how people find ways of squeezing Orange Man Bad into anything and everything.
    PS - I am not going to stay awake at night worrying that ICE will be deporting hordes of illegal aliens working at US AI companies.
- TMWNN 5 months ago
  It also runs quite well on a 24GB MacBook, which is nowadays more or less the base configuration unless you go out of your way to buy a suffixless-CPU MacBook Pro.
5 months ago
j45 5 months ago
DeepSeek is available via HuggingFace and can be run in/from tools like LM Studio locally with enough Ram. 14B models is pretty impressive.
If you want the optionality of using the full model, there are private hosted models that can be connected in cheaply for those use cases into a "one place for all the models" locally.