AI Research Recap: Ok Transformers, You have My Attention
A simplified summary of a stepping stone towards ChatGPT
In 2017, Ashish Vaswani (and 7 other Google Researchers) published a research paper at the 31st Conference on Neural Information Processing Systems titled very simply “Attention Is All You Need”. Despite this quaint title, their 15 page paper described a new architecture for language models that revolutionized the state of the art at that time and kickstarted the future of AI leading to the GPT based large language models that we have today. Since its publication “Attention Is All You Need” has been cited in over 120,000 other works. I recently sat down with this paper (assisted heavily by the paper’s great-great grandchild ChatGPT) and tried to understand what made it so revolutionary.
The paper can be found here for anyone wanting to follow along at home. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Note: I’m keeping it pretty high level so dig into the details at your own peril!
Authors Corner: Ashish Vaswani
When cracking open this paper I first wondered was who was behind it. The first thing you’ll notice is the author’s are almost all Google Researchers. Ashish Vaswani is the first name listed which would normally (as I understand) mean he was one of the largest contributors. However an asterisk on each author and a footnote on page one tells otherwise: “∗Equal contribution. Listing order is random.” As luck would have it, Ashish Vaswani’s won the internal Google-Brain-rock-paper-scissors and he now has the honor of being the name before the “et al.” in more than 120,000 papers across the Computer Science world.
Now CEO at Essential AI (https://www.linkedin.com/in/ashish-vaswani-99892181/), Vaswani has had a productive career in AI. In May 2023 he entered the hallowed halls of the wikipediable (https://en.wikipedia.org/wiki/Ashish_Vaswani), one of only two of the authors to reach that pinnacle. (The other being Aidan Gomez, https://en.wikipedia.org/wiki/Aidan_Gomez). Following his study at the Birla Institute of Technology, Mesra, Vaswani earned his PHD at the University of Southern California.
Other notable author notes:
For any blockchain fans out there, Illia Polosukhin went on to found NEAR protocol: https://www.linkedin.com/in/illia-polosukhin-77b6538, https://twitter.com/ilblackdragon
Background Info By Danny
First, some background info to help understand what makes this paper so groundbreaking. Keep in mind this is provided with apologies by yours truly and is just intended to help understand how language models work. This also relies heavily on how much I retained from Andrew Ng’s famous CS 229 Machine Learning course so be forewarned and check my understanding here (https://see.stanford.edu/Course/CS229, https://www.coursera.org/collections/machine-learning)
Text ⇒ Token Embeddings
At the end of the day all machine learning is just math and statistics. We start with things like images, text, or speech but the first step is to turn those very human-things into machine-things: numbers. This is where giant companies employ large banks of GPUs (driving up their price and keeping us from putting them in our gaming PCs) to “train” over huge datasets of all human knowledge and text. But we will handwave that away for now. The moral of the story is after a lot of training and math these machine learning models are left with a giant lookup table of word ⇒ number. We call these numbers “embeddings”.
Different words get different embeddings and thanks to that training we handwaved away, the embeddings assigned to words often carry meaning about those words. Similar words might have similar embeddings. Different words are farther apart.
(I’m representing these embeddings using just one number but in reality they are often made up of hundreds of numbers, with each number capturing even more information about that token)
So, when doing natural language processing you take the input text and you use this lookup table to turn it into numbers.
“Machine Learning is interesting and not nerdy” might become:
Neural Networks Need Structure Too
Once you have a numeric form for your text there are lots of interesting things you can do with it! If you have a machine learning model you can insert your string of numbers and get something out that is (hopefully) useful. Averaging all those words together can be useful for things like sentiment analysis. In some cases you could also use it for things like detecting if text was not safe for work. But for many tasks you need more structure. It isn’t only the words that matter but where they are in relation to each other. Plus you can end up having a lot more text (for example the truth about whether machine learning is interesting or nerdy might be more complicated).
Two solutions that proved effective for that are Recurrent Neural Networks and Convolutional Neural Networks. In Recurrent Neural Networks, each token is considered one at a time, in order. The network also has “memory” so that the output of processing every word is used at an input along side the next word. In this way the structure of the sentence is included in the network and the sentence structure can be used in both training and inference (inference is what its called when using the model after its been trained).
Convolutional Neural networks work similarly but they use a grid-like pattern instead of just taking the words one-by-one. This works well for things like images.
In both cases though, they are limited in two ways:
They can only look back so far - If something several sentences ago impacts text much later on, their “memory” is not enough to understand that relationship.
Going one-by-one is slow and unscalable! - You need the output from some tokens in order to compute the next tokens so you have to process it in order one by one. You cant scale up/parallelize it!
Maybe You Don’t Need All That Structure Nonsense
This brings us to a chief insight of the “Attention is All You Need” paper. How can we build a neural network that can understand sentence structure across large pieces of text, without needing to go through them one-by-one? Turns out you can using a new simple neural network architecture they call a “Transformer”.

When reading this paper there is a lot of discussion of “Attention” and how it works in training and inference. Attention in this context is talking about which parts of the input text the model focuses on (or focuses its “attention” on”). How these neural nets implement Attention is highly technical (though maybe a future question for us to explore in another post), but this has been used in Recurrent Neural Nets and Convolutional Neural Nets in the past. The novel idea here is to figure out how to give the model flexibility to focus its attention wherever in the text it needs to, without needing to go through the text word-by-word in order.
Maybe the Position Was Inside the Tokens All Along
Maybe we can put the position in the tokens! Vaswani et al. came up with the idea of altering our input text embeddings to include information about their position within the actual embeddings. While other neural networks try to understand sentence structure through the way they process sentences one-by-one, Vaswani adds information to the input tokens and allows processing the entire input sequence all at once. It just relies on the neural network to understand/learn during the training step that the position is encoded inside the input sequence as just another bit of information for it to train on.
For the more technically inclined the way they did this was take our embeddings (the groups of numbers from above) and just add a “positional embedding” correlating to each position in the sequence.
And just so that they could feel like they had earned their PHDs the authors decided to use this function to calculate those positional embeddings:
This looks fancy but it just means if you are the third word in the input text you will always get the same “positional embedding” added to you based on the above functions. The model is then able to learn that if it sees an input text embedding that has had PE(3) added to it that means its looking at the third word in the input.
FWIW they defend explain why they chose this function here.
This solves both of our problems from before! With the position information encoded in the input embeddings themselves and all tokens considered at once, the neural net has the (degrees of) freedom to understand relationships from very distant parts of text AND since the whole token sequence is all passed in at once, its perfect for distributing the computation and scaling up training across multiple machines.
“Attention Is All You Need” found that just 12 hours of training on 8 GPUs with their architecture produced a new state-of-the-art in translating english to german (as well as english to french). According to a random website I just found, ChatGPT 4 was trained on 25,000 Nvidia A100 GPUs for 90–100 days so 12 hours seems pretty short to achieve the best english to german translation at the time! (Back in the archaic yesteryear of 2017). Overall the Transformer network architecture proved simpler, faster, more flexible, and just better!
But Wait, What About…
Some future items that came out of this that I now want to look into to understand further and you may want to to if this interested you!
They say this lead to ChatGPT? How does that work and what is a GPT compared to a Transformer?
How does the input/tokenization work for things like Images (Sight) or Songs (Sound)? (And the natural logical progression of …. how could you tokenize smells?) Does the positional encodings work for those as well?
WTF is actually going in the Self-Attention section related to “queries”, “keys” and “values”? (I glossed over this cause
I didn’t understand itit’s beyond the scope)
Thanks for listening and please let me know if you see any significant errors that I should correct!
Looking forward to more posts. The illustrations & touches of humor make this quite readable at a high level, encouraging a drill down into the detail.