Silkenweb Example: Hackernews Clone

Ask HN: How to execute an 180B+ LLM on a Turing machine?

1 point by _ktnd 1 year ago | 0 comments

Or, equivalently, how would you approach running an 180B+ LLM on a Von Neumann computer with say 1 MB of main memory and virtually limitless secondary storage? Or, do you know about an approach (that you might read somewhere) that might help running a heavy LLM on virtually any Turing equivalent device?

Picture this: you're stuck with your “potato computer” (small RAM, no external GPU, very large SSD), and your LLM is saved on an external SSD.

Your task: run that LLM on your “potato PC” and try to achieve reasonable response times (e.g., 1 h to 24 h). Response times of 1 year, or higher, might be impractical for most use cases.

And on a side note, how would you figure out the response times of a language model on low-end devices (e.g., Raspberry Pi, business laptops, MSP430)? Would you just assume some basic operations such as linear algebra operations as a given and estimate the number of steps from there?

I expect the usual suspects brought up in this discussion:

— Memory Mapped I/O aka treating an I/O device such as an SSD as if it were actual RAM (mmap). BTW: `mmap` makes our secondary storage somewhat akin to an infinite tape in a Turing machine

— “LLM in a flash: Efficient Large Language Model Inference with Limited Memory”, https://arxiv.org/html/2312.11514v2 (04 Jan 2024)

— SSD "wear and tear"