Silkenweb Example: Hackernews Clone

StreamingLLM: tiny tweak to KV LRU improves long conversations

91 points by lucasluitjes 1 year ago | 8 comments

TrueDuality 1 year ago
There was a really interesting post a while ago about adjusting the softmax function to allow attention heads to not make a choice (https://www.evanmiller.org/attention-is-off-by-one.html). It seems like that might remove the need for these attention sinks entirely. I keep meaning to go in and perform tests on this but boy time gets away from you...
- zorgmonkey 1 year ago
  Feel free to mess with it, his tweak to softmax was actually supported by pytorch before the article was written, but off by default. Maybe it needs to be more widely used though, after all good ideas are often independently discovered multiple times. Details are in this tweet https://twitter.com/SamuelMullr/status/1683582347793530884 or if you don't like twitter the option is add_zero_attn for pytorch MultiheadAttention.
- magicalhippo 1 year ago
  Interesting! HN discussion of it here: https://news.ycombinator.com/item?id=36851494
popinman322 1 year ago
Previous discussion, on a link to the implementation: https://news.ycombinator.com/item?id=37740932
Translationaut 1 year ago
This seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128
gremlinsinc 1 year ago
I wonder if it could make sense to maybe have break away bots, where at 10k tokens a new one launches with the first 2k, and the last 1k and a table of contents such that when you go back to something you're handed off to a model where that data is stronger reinforced or something like that. Sort of like mixture of experts but they're only an expert about individual snippets of a long conversational thread.
- kgeist 1 year ago
  Here they simply used different models for different turns and apparently it gave more "engaging" results:
  https://arxiv.org/abs/2401.02994
- joshspankit 1 year ago
  You’re right: A lot of the conversation can be condensed, especially if there are enough cues for the AI to arrive in the same “neuronal neighborhood” as the previous conversation.