StreamingLLM: tiny tweak to KV LRU improves long conversations
91 points by lucasluitjes 1 year ago | 8 comments- TrueDuality 1 year agoThere was a really interesting post a while ago about adjusting the softmax function to allow attention heads to not make a choice (https://www.evanmiller.org/attention-is-off-by-one.html). It seems like that might remove the need for these attention sinks entirely. I keep meaning to go in and perform tests on this but boy time gets away from you...
- zorgmonkey 1 year agoFeel free to mess with it, his tweak to softmax was actually supported by pytorch before the article was written, but off by default. Maybe it needs to be more widely used though, after all good ideas are often independently discovered multiple times. Details are in this tweet https://twitter.com/SamuelMullr/status/1683582347793530884 or if you don't like twitter the option is add_zero_attn for pytorch MultiheadAttention.
- magicalhippo 1 year agoInteresting! HN discussion of it here: https://news.ycombinator.com/item?id=36851494
- zorgmonkey 1 year ago
- popinman322 1 year agoPrevious discussion, on a link to the implementation: https://news.ycombinator.com/item?id=37740932
- Translationaut 1 year agoThis seems only to work cause large GPTs have redundant, undercomplex attentions. See this issue in BertViz about attention in Llama: https://github.com/jessevig/bertviz/issues/128
- gremlinsinc 1 year agoI wonder if it could make sense to maybe have break away bots, where at 10k tokens a new one launches with the first 2k, and the last 1k and a table of contents such that when you go back to something you're handed off to a model where that data is stronger reinforced or something like that. Sort of like mixture of experts but they're only an expert about individual snippets of a long conversational thread.
- kgeist 1 year agoHere they simply used different models for different turns and apparently it gave more "engaging" results:
- joshspankit 1 year agoYou’re right: A lot of the conversation can be condensed, especially if there are enough cues for the AI to arrive in the same “neuronal neighborhood” as the previous conversation.
- kgeist 1 year ago