Visualizing Attention, a Transformer's Heart [video]

999 points by rohitpaulk 1 year ago | 172 comments
  • Xcelerate 1 year ago
    As someone with a background in quantum chemistry and some types of machine learning (but not neural networks so much) it was a bit striking while watching this video to see the parallels between the transformer model and quantum mechanics.

    In quantum mechanics, the state of your entire physical system is encoded as a very high dimensional normalized vector (i.e., a ray in a Hilbert space). The evolution of this vector through time is given by the time-translation operator for the system, which can loosely be thought of as a unitary matrix U (i.e., a probability preserving linear transformation) equal to exp(-iHt), where H is the Hamiltonian matrix of the system that captures its “energy dynamics”.

    From the video, the author states that the prediction of the next token in the sequence is determined by computing the next context-aware embedding vector from the last context-aware embedding vector alone. Our prediction is therefore the result of a linear state function applied to a high dimensional vector. This seems a lot to me like we have produced a Hamiltonian of our overall system (generated offline via the training data), then we reparameterize our particular subsystem (the context window) to put it into an appropriate basis congruent with the Hamiltonian of the system, then we apply a one step time translation, and finally transform the resulting vector back into its original basis.

    IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

    • francasso 1 year ago
      I don't think the analogy holds: even if you forget all the preceding non linear steps, you are still left with just a linear dynamical system. It's neither complex nor unitary, which are two fundamental characteristics of quantum mechanics.
      • bdjsiqoocwk 1 year ago
        I think you're just describing a state machine, no? The fact that you encode the state in a vector and steps by matrices is an implementation detail...?
        • Xcelerate 1 year ago
          Perhaps a probabilistic FSM describes the actual computational process better since we don’t have a concept equivalent to superposition with transformers (I think?), but the framework of a FSM alone doesn’t seem to capture the specifics of where the model/machine comes from (what I’m calling the Hamiltonian), nor how a given context window (the subsystem) relates to it. The change of basis that involves the attention mechanism (to achieve context-awareness) seems to align better with existing concepts in QM.

          One might model the human brain as a FSM as well, but I’m not sure I’d call the predictive ability of the brain an implementation detail.

          • BoGoToTo 1 year ago
            | context window

            I actually just asked a question on the physics stack exchange that is semi relevant to this. https://physics.stackexchange.com/questions/810429/functiona...

            In my question I was asking about a hypothetical time-evolution operator that includes an analog of a light cone that you could think of as a context window. If you had a quantum state that was evolved through time by this operator then I think you could think of the speed of light being a byproduct of the width of the context window of some operator that progresses the quantum state forward by some time interval.

            Note I am very much hobbyist-tier with physics so I could also be way off base and this could all be nonsense.

          • feoren 1 year ago
            Not who you asked (and I don't quite understand everything) but I think that's about right, except in the continuous world. You pick an encoding scheme (either the Lagrangian or the Hamiltonian) to go from state -> vector. You have a "rules" matrix, very roughly similar to a Markov matrix, H, and (stretching the limit of my knowledge here) exp(-iHt) very roughly "translates" from the discrete stepwise world to the continuous world. I'm sure that last part made more knowledgeable people cringe, but it's roughly in the right direction. The part I don't understand at all is the -i factor: exp(-it) just circles back on itself after t=2pi, so it feels like exp(-iHt) should be a periodic function?
            • empiricus 1 year ago
              Yes, exp(-iHt) means the vector state is rotating as time passes, and it rotates faster when the Hamiltonian (energy) is bigger. This rotation gives the wave like behavior. Slightly related, there is an old video of Feynman where he tries to teach quantum mechanics to some art students, and he explains this complex rotation and its effects without any reference to math.
          • BoGoToTo 1 year ago
            I've been thinking about his a bit lately. If time is non-continuous then could you model the time evolution of the universe as some operator recursively applied to the quantum state of the universe? If each application of the operator progresses the state of the universe by a single planck-time could we even observe a difference between that and a universe where time is continuous?
            • tweezy 1 year ago
              So one of the most "out there" non-fiction books I've read recently is called "Alien Information Theory". It's a wild ride and there's a lot of flat-out crazy stuff in it but it's a really engaging read. It's written by a computational neuroscientist who's obsessed with DMT. The DMT parts are pretty wild, but the computational neuroscience stuff is intriguing.

              In one part he talks about a thought experiment modeling the universe as a multidimensional cellular automata. Where fundamental particles are nothing more than the information they contain. And particles colliding is a computation that tells how that node and the adjacent nodes to update their state.

              Way out and not saying there's anything truth to it. But it was a really interesting and fun concept to chew on.

              • phrotoma 1 year ago
                Definitely way out there and later chapters are what I can only describe as wild conjecture, but I also found it to be full of extremely accessible foundational chapters on brain structure and function.
                • andoando 1 year ago
                  Im working on a model to do just that :) The game of life is not too far off either.
                  • Gooblebrai 1 year ago
                    You might enjoy his next book: Reality Switch.
                  • BobbyTables2 1 year ago
                    I think Wolfram made news proposing something roughly along these lines.

                    Either way, I find Planck time/energy to be a very spooky concept.

                    https://wolframphysics.org/

                    • pas 1 year ago
                      This sounds like the Bohmian pilot wave theory (which is a global formulation of QM). ... Which might be not that crazy, since spooky action at a distance is already a given. And in cosmology (or quantum gravity) some models are describing a region of space based only its surface. So in some sense the universe is much less information dense, than we think.

                      https://en.m.wikipedia.org/wiki/Holographic_principle

                    • cmgbhm 1 year ago
                      Not a direct comment on the question but I had a math PhD as an intern before. One of his comments was having tons of high dimensional linear algebra stuff was super advanced 1900s and has plenty of room for new cs discovery.

                      Didn’t make the “what was going on then in physics “ connection until now.

                      • tpurves 1 year ago
                        So what you are saying is that, we've reached the point where our own most sophisticated computer models are starting to approach the same algorithms that define the universe we live in? Aka, the simulation is showing again?
                        • lagrange77 1 year ago
                          I only understand half of it, but it sounds very interesting. I've always wondered, if the principle of stationary action could be of any help with machine learning, e.g. provide an alternative point of view / formulation.
                        • seydor 1 year ago
                          I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY

                          Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another

                          • mjburgess 1 year ago
                            The explanation is just that NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words). Their weights are a model of this distribution. LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data.

                            Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

                            Why is 'London in UK' "known" but 'London in France' isnt? Just because 'UK' much more frequently occurs in the dataset.

                            The algorithm isnt doing anything other than aligning computation to hardware; the computation isnt doing anything interesting. The value comes from the conditional probability structure in the data. -- that comes from people arranging words usefully, because they're communicating information with one another

                            • nerdponx 1 year ago
                              I think you're downplaying the importance of the attention/transformer architecture here. If it was "just" a matter of throwing compute at probabilities, then we wouldn't need any special architecture at all.

                              P(next_word|previous_words) is ridiculously hard to estimate in a way that is actually useful. Remember how bad text generation used to be before GPT? There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

                              • JeremyNT 1 year ago
                                > There is innovation in discovering an architecture that makes it possible to learn P(next_word|previous_words), in addition to the computing techniques and hardware improvements required to make it work.

                                Isn't that essentially what mjburgess said in the parent post?

                                > LLMs are a hardware innovation: they make it possible for GPUs to compute this at scale across TBs of data... The algorithm isnt doing anything other than aligning computation to hardware

                                • mjburgess 1 year ago
                                  Yes, it's really hard -- the innovation is aligning the really basic dot-product similarity mechanism to hardware. You can use basically any NN structure to do the same task, the issue is that they're untrainable because they arent parallizable.

                                  There is no innovation here in the sense of a brand new algorithm for modelling conditional probabilities -- the innovation is in adapting the algorithm for GPU training on text/etc.

                                • IanCal 1 year ago
                                  This is wrong, or at least a simplification to the point of removing any value.

                                  > NNs are a stat fitting alg learning a conditional probability distribution, P(next_word|previous_words).

                                  They are trained to maximise this, yes.

                                  > Their weights are a model of this distribution.

                                  That doesn't really follow, but let's leave that.

                                  > Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

                                  Here's the rub. If how you describe them is all they're doing then a sequence of never-before-seen words would have no valid response. All words would be equally likely. It would mean that a single brand new word would result in absolute gibberish following it as there's nothing to go on.

                                  Let's try:

                                  Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.

                                  Result: You now have two kjsdhlisrnj.

                                  I would wager a solid amount that kjsdhlisrnj never appears in the input data. If it does pick another one, it doesn't matter.

                                  So we are learning something more general than the frequencies of sequences of tokens.

                                  I always end up pointing to this but OthelloGPT is very interesting https://thegradient.pub/othello/

                                  While it's trained on sequences of moves, what it does is more than just "sequence a,b,c is followed by d most often"

                                  • mjburgess 1 year ago
                                    Any NN "trained on" data sampled from an abstract complete outcome space (eg., a game with formal rules; mathematical sequences, etc) can often represent that space completely. It comes down to whether you can form conditional probability models of the rules, and that's usually possible because that's what abstract rules are.

                                    > I have one kjsdhlisrnj and I add another kjsdhlisrnj, tell me how many kjsdhlisrnj I now have.

                                    1. P(number-word|tell me how many...) > P(other-kinds-of-words|tell me how many...)

                                    2. P(two|I have one ... I add another ...) > P(one|...) > P(three|...) > others

                                    This is trivial.

                                    • pas 1 year ago
                                      how does it work underneath?

                                      "kjsdhlisrnj" is in the context, it gets tokenized, and now when the LLM is asked to predict/generate next-token sequences somehow "kjsdhlisrnj" is there too. it learns patterns. okay sure, they ger encoded somehow, but during infernce how does this lead to application of a recalled pattern on the right token(s)?

                                      also, can it invent new words?

                                    • albertzeyer 1 year ago
                                      You are more speaking about n-gram models here. NNs do far more than that.

                                      Or if you just want to say that NNs are used as a statistical model here: Well, yea, but that doesn't really tell you anything. Everything can be a statistical model.

                                      E.g., you could also say "this is exactly the way the human brain works", but it doesn't really tell you anything how it really works.

                                      • cornholio 1 year ago
                                        > "this is exactly the way the human brain works"

                                        I'm always puzzled by such assertions. A cursory look at the technical aspects of an iterated attention - perceptron transformation clearly shows it's just a convoluted and powerful way to query the training data, a "fancy" Markov chain. The only rationality it can exhibit is that which is already embedded in the dataset. If trained on nonsensical data it would generate nonsense and if trained with a partially non-sensical dataset it will generate an average between truth and nonsense that maximizes some abstract algorithmic goal.

                                        There is no knowledge generation going on, no rational examination of the dataset through the lens of an internal model of reality that allows the rejection of invalid premises. The intellectual food already chewed and digested in the form of the training weights, with the model just mechanically extracting the nutrients, as opposed to venturing in the outside world to hunt.

                                        So if it works "just like the human brain", it does so in a very remote sense, just like a basic neural net works "just like the human brain", i.e individual biological neurons can be said to be somewhat similar.

                                        • mjburgess 1 year ago
                                          My description is true of any statistical learning algorithm.

                                          The thing that people are looking to for answers, the NN itself, does not have them. That's like looking to Newton's compass to understand his general law of gravitation.

                                          The reason that LLMs trained on the internet and every ebook has the structure of human communication is because the dataset has that structure. Why does the data have that structure? this requires science, there is no explanation "in the compass".

                                          NNs are statistical models trained on data -- drawing analogies to animals is a mystification that causes people's ability to think clearly he to jump out the window. No one compares stock price models to the human brain; no banking regulator says, "well your volatility estimates were off because your machines had the wrong thoughts". This is pseudoscience.

                                          Animals are not statistical learning algorithms, so the reason that's uninformative is because it's false. Animals are in direct causal contact with the world and uncover its structure through interventional action and counterfactual reasoning. The structure of animal bodies, and the general learning strategies are well-known, and having nothing to do with LLMs/NNs.

                                          The reason that I know "The cup is in my hand" is not because P("The cup is in my hand"|HistoricalTexts) > P(not "The cup is in my hand"|HistoricalTexts)

                                        • michaelt 1 year ago
                                          That's not really an explanation that tells people all that much, though.

                                          I can explain that car engines 'just' convert gasoline into forward motion. But if a the person hearing the explanation is hoping to learn what a cam belt or a gearbox is, or why cars are more reliable now than they were in the 1970s, or what premium gas is for, or whether helicopter engines work on the same principle - they're going to need a more detailed explanation.

                                          • mjburgess 1 year ago
                                            It explains the LLM/NN. If you want to explain why it emits words in a certain order you need to explain how reality generated the dataset, ie., you need to explain how people communicate (and so on).

                                            There is no mystery why an NN trained on the night sky would generate nightsky-like photos; the mystery is why those photos have those patterns... solving that is called astrophysics.

                                            Why do people, in reasoning through physics problems, write symbols in a certain order? Well, explain physics, reasoning, mathematical notation, and so on. The ordering of the symbols gives rise to a certain utility of immitating that order -- but it isnt explained by that order. That's circular: "LLMs generate text in the order they do, because that's the order of the text they were given"

                                          • forrestthewoods 1 year ago
                                            I find this take super weak sauce and shallow.

                                            This recent $10,000 challenge is super super interesting imho. https://twitter.com/VictorTaelin/status/1778100581837480178

                                            State of the art models are doing more than “just” predicting the probability of the next symbol.

                                            • mjburgess 1 year ago
                                              You underestimate the properties of the sequential-conditional structure of human communication.

                                              Consider how a clever 6yo could fake being a physicist with access to a library of physics textbooks and a shredder. All the work is done for them. You'd need to be a physicist to spot them faking it.

                                              Of course, LLMs are in a much better position than having shredded physics textbooks -- they have shreddings of all books. So you actually have to try to expose this process, rather than just gullibly prompt using confirmation bias. It's trivial to show they work this way, both formally and practically.

                                              The issue is, practically, gullible people aren't trying.

                                            • sirsinsalot 1 year ago
                                              It isn't some kind of Markov chain situation. Attention cross-links the abstract meaning of words, subtle implications based on context and so on.

                                              So, "mat" follows "the cat sat on the" where we understand the entire worldview of the dataset used for training; not just the next-word probability based on one or more previous words ... it's based on all previous meaning probability, and those meaning probablility and so on.

                                              • seydor 1 year ago
                                                People specifically would like to know what the attention calculations add to this learning of the distribution
                                                • ffwd 1 year ago
                                                  Just speculating but I think attention enables differentiation of semantic concepts for a word or sentence within a particular context. Like for any total set of training data you have a lesser number of semantic concepts (like let's say you have 10000 words, then it might contain 2000 semantic concepts, and those concepts are defined by the sentence structure and surrounding words, which is why they have a particular meaning), and then attention allows to differentiate those different contexts at different levels (words/etc). Also the fact you can do this attention at runtime/inference means you can generate the context from the prompt, which enables the flexibility of variable prompt/variable output but you lose the precision of giving an exact prompt and getting an exact answer
                                                • astrange 1 year ago
                                                  LLMs don't work on words, they work on sequences of subword tokens. "It doesn't actually do anything" is a common explanation that's clearly a form of cope, because you can't even explain why it can form complete words, let alone complete sentences.
                                                  • fspeech 1 year ago
                                                    There are an infinite number of distributions that can fit the training data well (e.g., one that completely memorize the data and therefore replicate the frequencies). The trick is to find the distributions that generalize well, and here the NN architecture is critical.
                                                    • fellendrone 1 year ago
                                                      > Why does, 'mat' follow from 'the cat sat on the ...'

                                                      You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.

                                                      This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.

                                                      Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.

                                                      You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.

                                                      You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:

                                                      For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.

                                                      For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.

                                                      For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/

                                                      For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.

                                                      • raindear 1 year ago
                                                        But why do transformers perform better than older language models including other neural language models.
                                                        • nextaccountic 1 year ago
                                                          > Why does, 'mat' follow from 'the cat sat on the ...' because 'mat' is the most frequent word in the dataset; and the NN is a model of those frequencies.

                                                          What about cases that are not present in the dataset?

                                                          The model must be doing something besides storing raw probabilities to avoid overfitting and enable generalization (imagine that you could have a very performant model - when it works - but it sometimes would spew "Invalid input, this was not in the dataset so I don't have a conditional probability and I will bail out")

                                                        • blt 1 year ago
                                                          As a computer scientist, the "differentiable hash table" interpretation worked for me. The AIAYN paper alludes to it by using the query/key/value names, but doesn't explicitly say the words "hash table". I guess some other paper introduced them?
                                                          • nerdponx 1 year ago
                                                            > TBF there is no good explanation why it works

                                                            My mental justification for attention has always been that the output of the transformer is a sequence of new token vectors such that each individual output token vector incorporates contextual information from the surrounding input token vectors. I know it's incomplete, but it's better than nothing at all.

                                                            • eurekin 1 year ago
                                                              > TBF there is no good explanation why it works

                                                              I thought the general consesus was: "transformers allow neural networks to have adaptive weights".

                                                              As opposed to the previous architectures, were every edge connecting two neurons always has the same weight.

                                                              EDIT: a good video, where it's actually explained better: https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay

                                                              • rcarmo 1 year ago
                                                                You're effectively steering the predictions based on adjacent vectors (and precursors from the prompt). That mental model works fine.
                                                            • rayval 1 year ago
                                                              Here's a compelling visualization of the functioning of an LLM when processing a simple request: https://bbycroft.net/llm

                                                              This complements the detailed description provided by 3blue1brown

                                                              • bugthe0ry 1 year ago
                                                                When visualised this way, the scale of GPT-3 is insane. I can't imagine what 4 would like here.
                                                                • spi 1 year ago
                                                                  IIRC, GPT-4 would actually be a bit _smaller_ to visualize than GPT3. Details are not public, but from the leaks GPT-4 (at least, some by-now old version of it) was a mixture of expert, with every model having around 110B parameters [1]. So, while the total number of parameters is bigger than GPT-3 (1800B vs. 175B), it is "just" 16 copies of a smaller (110B) parameters model. So if you wanted to visualize it in any meaningful way, the plot wouldn't grow bigger - or it would, if you included all different experts, but they are just copies of the same architecture with different parameters, which is not all that useful for visualization purposes.

                                                                  [1] https://medium.com/@daniellefranca96/gpt4-all-details-leaked...

                                                                  • joaogui1 1 year ago
                                                                    Mixture of Experts is not just 16 copies of a network, it's a single network where for the feed forward layers the tokens are routed to different experts, but the attention layers are still shared. Also there are interesting choices around how the routing works and I believe the exact details of what OpenAI is doing are not public. In fact I believe someone making a visualization of that would dispell a ton of myths around what are MoEs and how they work
                                                                • lying4fun 1 year ago
                                                                  amazing visualisation
                                                                • tylerneylon 1 year ago
                                                                  Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

                                                                  One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:

                                                                  https://learnandburn.ai/p/how-to-build-a-10m-token-context

                                                                  (I edited that article.)

                                                                  • danielhanchen 1 year ago
                                                                    Oh with Flash Attention, you never have to construct the (S, S) matrix ever (also in article) Since its softmax(Q @ K^T / sqrt(d)) @ V, you can form the final output in tiles.

                                                                    In Unsloth, memory usage scales linearly (not quadratically) due to Flash Attention (+ you get 2x faster finetuning, 80% less VRAM use + 2x faster inference). Still O(N^2) FLOPs though.

                                                                    On that note, on long contexts, Unsloth's latest release fits 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K context on H100.

                                                                    • rahimnathwani 1 year ago
                                                                      He lists Ring Attention and half a dozen other techniques, but they're not within the scope of this video: https://youtu.be/eMlx5fFNoYc?t=784
                                                                      • 1 year ago
                                                                    • promiseofbeans 1 year ago
                                                                      His previous post 'But what is a GPT?' is also really good: https://www.3blue1brown.com/lessons/gpt
                                                                      • YossarianFrPrez 1 year ago
                                                                        This video (with a slightly different title on YouTube) helped me realize that the attention mechanism isn't exactly a specific function so much as it is a meta-function. If I understand it correctly, Attention + learned weights effectively enables a Transformer to learn a semi-arbitrary function, one which involves a matching mechanism (i.e., the scaled dot-product.)
                                                                        • hackinthebochs 1 year ago
                                                                          Indeed. The power of attention is that it searches the space of functions and surfaces the best function given the constraints. This is why I think linear attention will never come close to the ability of standard attention, the quadratic term is a necessary feature of searching over all pairs of inputs and outputs.
                                                                        • abotsis 1 year ago
                                                                          I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.
                                                                        • nostrebored 1 year ago
                                                                          Working in a closely related space and this instantly became part of my team's onboarding docs.

                                                                          Worth noting that a lot of the visualization code is available in Github.

                                                                          https://github.com/3b1b/videos/tree/master/_2024/transformer...

                                                                          • sthatipamala 1 year ago
                                                                            Sounds interesting; what else is part of those onboarding docs?
                                                                          • bilsbie 1 year ago
                                                                            I finally understand this! Why did every other video make it so confusing!
                                                                            • chrishare 1 year ago
                                                                              It is confusing, 3b1b is just that good.
                                                                              • visarga 1 year ago
                                                                                At the same time it feels extremely simple

                                                                                attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

                                                                                is just half a row; the multi-head, masking and positional stuff just toppings

                                                                                we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math

                                                                                • diedyesterday 1 year ago
                                                                                  Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.

                                                                                  Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).

                                                                                  • bilsbie 1 year ago
                                                                                    For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.
                                                                                • ur-whale 1 year ago
                                                                                  > Why did every other video make it so confusing!

                                                                                  In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.

                                                                                  I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.

                                                                                  • namaria 1 year ago
                                                                                    It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.
                                                                                  • thomasahle 1 year ago
                                                                                    I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.

                                                                                    I'm curious what things other videos did worse compared to 3b1b?

                                                                                    • bilsbie 1 year ago
                                                                                      I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.

                                                                                      I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.

                                                                                    • thinkingtoilet 1 year ago
                                                                                      Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.
                                                                                      • Al-Khwarizmi 1 year ago
                                                                                        Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:

                                                                                        1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".

                                                                                        2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.

                                                                                        3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.

                                                                                        Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.

                                                                                        • cmplxconjugate 1 year ago
                                                                                          >This is probably an unpopular opinion.

                                                                                          I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.

                                                                                          I think though your third point is the most important; right now people are focused on results.

                                                                                          • maleldil 1 year ago
                                                                                            > This is probably an unpopular opinion

                                                                                            There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.

                                                                                            [1] https://jalammar.github.io/illustrated-transformer/

                                                                                          • 1 year ago
                                                                                            • Solvency 1 year ago
                                                                                              Because:

                                                                                              1. good communication requires an intelligence that most people sadly lack

                                                                                              2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.

                                                                                              most documents are written by authors subconsciously desperate to mentally flex on their peers.

                                                                                              • penguin_booze 1 year ago
                                                                                                Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.

                                                                                                Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.

                                                                                                • 1 year ago
                                                                                                • WithinReason 1 year ago
                                                                                                  2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.
                                                                                              • shahbazac 1 year ago
                                                                                                Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”

                                                                                                Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.

                                                                                                Jeremy Howard made a comment on Twitter that he had seen various versions of this idea come up again and again - implying that this was a natural idea. I would love to see examples of where else this has come up so I can build an intuitive understanding.

                                                                                                • HarHarVeryFunny 1 year ago
                                                                                                  Roughly:

                                                                                                  1) The initial seq-2-seq approach was using LSTMs - one to encode the input sequence, and one to decode the output sequence. It's amazing that this worked at all - encode a variable length sentence into a fixed size vector, then decode it back into another sequence, usually of different length (e.g. translate from one language to another).

                                                                                                  2) There are two weaknesses of this RNN/LSTM approach - the fixed size representation, and the corresponding lack of ability to determine which parts of the input sequence to use when generating specific parts of the output sequence. These deficiencies were addressed by Bahdanau et al in an architecture that combined encoder-decoder RNNs with an attention mechanism ("Bahdanau attention") that looked at each past state of the RNN, not just the final one.

                                                                                                  3) RNNs are inefficient to train, so Jakob Uszkoreit was motivated to come up with an approach that better utilized available massively parallel hardware, and noted that language is as much hierarchical as sequential, suggesting a layered architecture where at each layer the tokens of the sub-sequence would be processed in parallel, while retaining a Bahdanau-type attention mechanism where these tokens would attend to each other ("self-attention") to predict the next layer of the hierarchy. Apparently in initial implementation the idea worked, but not better than other contemporary approaches (incl. convolution), but then another team member, Noam Shazeer, took the idea and developed it, coming up with an architecture (which I've never seen described) that worked much better, which was then experimentally ablated to remove unnecessary components, resulting in the original transformer. I'm not sure who came up with the specific key-based form of attention in this final architecture.

                                                                                                  4) The original transformer, as described in the "attention is all you need paper", still had a separate encoder and decoder, copying earlier RNN based approaches, and this was used in some early models such as Google's BERT, but this is unnecessary for language models, and OpenAI's GPT just used the decoder component, which is what everyone uses today. With this decoder-only transformer architecture the input sentence is input into the bottom layer of the transformer, and transformed one step at a time as it passes through each subsequent layer, before emerging at the top. The input sequence has an end-of-sequence token appended to it, which is what gets transformed into the next-token (last token) of the output sequence.

                                                                                                  • krat0sprakhar 1 year ago
                                                                                                    Thank you for this summary! Very well explained. Any tips on what resources you use to keep updated on this field?
                                                                                                    • HarHarVeryFunny 1 year ago
                                                                                                      Thanks. Mostly just Twitter, following all the companies & researchers for any new announcements, then reading any interesting papers mentioned/linked. I also subscribe to YouTube channels like Dwarkesh Patel (interviewer) and Yannic Kilcher (AI News), and search out YouTube interviews with the principles. Of course I also read any AI news here on HN, and sometimes there may be interesting information in the comments.

                                                                                                      There's a summary of social media AI news here, that sometimes surfaces something interesting.

                                                                                                      https://buttondown.email/ainews/archive/

                                                                                                  • ollin 1 year ago
                                                                                                    karpathy gave a good high-level history of the transformer architecture in this Stanford lecture https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618
                                                                                                  • namelosw 1 year ago
                                                                                                    You might also want to check out other 3b1b videos on neural networks since there are sort of progressions between each video https://www.3blue1brown.com/topics/neural-networks
                                                                                                    • jiggawatts 1 year ago
                                                                                                      It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.

                                                                                                      I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

                                                                                                      • chrishare 1 year ago
                                                                                                        His content and impact is phenomenal
                                                                                                        • 1 year ago
                                                                                                      • mastazi 1 year ago
                                                                                                        That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.
                                                                                                        • hamburga 1 year ago
                                                                                                          I think Ilya gets credit for that example — I’ve heard him use it in his interview with Jensen Huang.
                                                                                                        • justanotherjoe 1 year ago
                                                                                                          It seems he brushes over the positional encoding, which for me was the most puzzling part of transformers. The way I understood it, positional encoding is much like dates. Just like dates, there are repeating minutes, hours, days, months...etc. Each of these values has shorter 'wavelength' than the next. The values are then used to identify the position of each tokens. Like, 'oh, im seeing january 5th tokens. I'm january 4th. This means this is after me'. Of course the real pos.encoding is much smoother and doesn't have abrupt end like dates/times, but i think this was the original motivation for positional encodings.
                                                                                                          • nerdponx 1 year ago
                                                                                                            That's one way to think about it.

                                                                                                            It's clever way to encode "position in sequence" as some kind of smooth signal that can be added to each input vector. You might appreciate this detailed explanation: https://towardsdatascience.com/master-positional-encoding-pa...

                                                                                                            Incidentally, you can encode dates (e.g. day of week) in a model as sin(day of week) and cos(day of week) to ensure that "day 7" is mathematically adjacent to "day 1".

                                                                                                            • 1 year ago
                                                                                                          • bjornsing 1 year ago
                                                                                                            This was the best explanation I’ve seen. I think it comes down to essentially two aspects: 1) he doesn’t try to hide complexity and 2) he explains what he thinks is the purpose of each computation. This really reduces the room for ambiguity that ruins so many other attempts to explain transformers.
                                                                                                            • stillsut 1 year ago
                                                                                                              In training we learn a.) the embeddings and b.) the KQ/MLP-weights.

                                                                                                              How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?

                                                                                                              Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?

                                                                                                              • rollinDyno 1 year ago
                                                                                                                Hold on, every predicted token is only a function of the previous token? I must have something wrong. This would mean that within the embedding of "was", which is of length 12,228 in this example. Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?
                                                                                                                • jgehring 1 year ago
                                                                                                                  That's what happens in the very last layer. But at that point the embedding for "was" got enriched multiple times, i.e., in each attention pass, with information from the whole context (which is the whole novel here). So for the example, it would contain the information to predict, let's say, the first token of the first name of the murderer.

                                                                                                                  Expanding on that, you could imagine that the intent of the sentence to complete (figuring out the murderer) would have to be captured in the first attention passes so that other layers would then be able to integrate more and more context in order to extract that information from the whole context. Also, it means that the forward passes for previous tokens need to have extracted enough salient high-level information already since you don't re-compute all attention passes for all tokens for each next token to predict.

                                                                                                                  • causal 1 year ago
                                                                                                                    > you don't re-compute all attention passes for all tokens for each next token to predict.

                                                                                                                    You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.

                                                                                                                    Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?

                                                                                                                  • diedyesterday 1 year ago
                                                                                                                    > "Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?"

                                                                                                                    Not with this GPT. The context size would not allow keeping attention to the total meaning of more than 2048 tokens (as reflected in the transformed embedding of that context's last token). For a substantial part of a novel, it would require a much larger context size with then presumably will need a higher dimensional embedding/semantic space.

                                                                                                                    • causal 1 year ago
                                                                                                                      I read this comment yesterday and keep thinking about it. That final token really must "comprehend" everything leading up to it, right? In which case longer context lengths are just trying to pack more meaning into that embedding state.

                                                                                                                      Which means the embedding model must do a lot of the lifting to be able to accurately represent meaning across long contexts so well. Now I want to know more about how those models are derived.

                                                                                                                      • vanjajaja1 1 year ago
                                                                                                                        at that point what it has is not a representation of the input, its a representation of what the next output could be. ie. its a lossy process and you can't extract what came in the past, only the details relevant to next word prediction

                                                                                                                        (is my understanding)

                                                                                                                        • rollinDyno 1 year ago
                                                                                                                          If the point was the presentation of only the next token, and predicted tokens were a function of only the preceding token, then the vector of the new token wouldn’t have the information to produce new tokens that kept telling the novel.
                                                                                                                        • faramarz 1 year ago
                                                                                                                          it's not about a single point encapsulating a novel, but how sequences of such embeddings can represent complex ideas when processed by the model's layers.

                                                                                                                          each prediction is based on a weighted context of all previous tokens, not just the immediately preceding one.

                                                                                                                          • rollinDyno 1 year ago
                                                                                                                            That weighted context is the 12228 dimensional vector, no?

                                                                                                                            I suppose that when you each element in the vector weighs 16 bits then the space is immense and capable to have a novel in a point.

                                                                                                                            • faramarz 1 year ago
                                                                                                                              GPT-4 is configurable up to 96 layers, each running their own embeddings. I think it was a business choice to afford the compute while they scale.
                                                                                                                              • causal 1 year ago
                                                                                                                                But if I understand correctly, GPT-4 reduces that to a 1536-dimensional vector. Roughly 1/8th. It's counterintuitive to me.
                                                                                                                            • evolvingstuff 1 year ago
                                                                                                                              You are correct, that is an error in an otherwise great video. The k+1 token is not merely a function of the kth vector, but rather all prior vectors (combined using attention). There is nothing "special" about the kth vector.
                                                                                                                            • kordlessagain 1 year ago
                                                                                                                              What I'm now wondering about is how intuition to connect completely separate ideas works in humans. I will have very strong intuition something is true, but very little way to show it directly. Of course my feedback on that may be biased, but it does seem some people have "better" intuition than others.
                                                                                                                              • thomasahle 1 year ago
                                                                                                                                I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!
                                                                                                                                • imjonse 1 year ago
                                                                                                                                  It is the first time I hear about the Value matrix being low rank, so for me this was the confusing part. Codebases I have seen also have value + output matrixes so it is clearer that Q,K,V are similar sizes and there's a separate projection matrix that adapts to the dimensions of the next network layer. UPDATE: He mentions this in the last sections of the video.
                                                                                                                                • cs702 1 year ago
                                                                                                                                  Fantastic work by Grant Sanderson, as usual.

                                                                                                                                  Attention has won.[a]

                                                                                                                                  It deserves to be more widely understood.

                                                                                                                                  ---

                                                                                                                                  [a] Nothing has outperformed attention so far, not even Mamba: https://arxiv.org/abs/2402.01032

                                                                                                                                  • mehulashah 1 year ago
                                                                                                                                    This is one of the best explanations that I’ve seen on the topic. I wish there was more work, however, not on how Transfomers work, but why they work. We are still figuring it out, but I feel that the exploration is not at all systematic.
                                                                                                                                    • spacecadet 1 year ago
                                                                                                                                      Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.
                                                                                                                                    • kjhenner 1 year ago
                                                                                                                                      The first time I really dug into transformers (back in the BERT days) I was working on a MS thesis involving link prediction in a graph of citations among academic documents. So I had graphs on the brain.

                                                                                                                                      I have a spatial intuition for transformers as a sort of analog to a message passing network over a "leaky graph" in an embedding space. If each token is a node, its key vector sets the position of an outlet pipe that it spews value to diffuse out into the embedding space, while the query vector sets the position of an input pipe that sucks up value other tokens have pumped out into the same space. Then we repeat over multiple attention layers, meaning we have these higher order semantic flows through the space.

                                                                                                                                      Seems to make a lot of sense to me, but I don't think I've seen this analogy anywhere else. I'm curious if anybody else thinks of transformers in this way. (Or wants to explain how wrong/insane I am?)

                                                                                                                                      • jacksonhacker 1 year ago
                                                                                                                                        [dead]