The year is 2018. Avengers: Infinity War recently tore up the box office, the Spice Girls just announced they were getting back together, and one year ago the transformative Neural Network research paper “Attention Is All You Need” was published marking a giant step forward in the state of the art with a new “Transformer” architecture featuring revolutionary self-attention feedback layers and position-aware token embeddings. You are Alec Radford, a young researcher at small upstart OpenAI (founded in 2015 with $1 billion in funding by Elon Musk, y-combinator CEO Sam Altman, and several other high profile CTOs/researchers) and you are looking to make a name for yourself in the AI world. You may be younger and less experienced than your peers, but you sit down, stare down that inexperience, and ask yourself: So you can train transformer-based neural networks to be good at a translating English to German, big whoop. Can we instead make these good at……everything? Turns out that transformer-architecture neural networks can indeed be generalized to apply to a wide variety of tasks with just a small amount of fine-tuning through a “Generative Pre-training Transformer” (cough*GPT*cough) as Alec et al. showed in their 2018 paper “Improving Language Understanding by Generative Pre-Training”.
Authors Corner: Alec Radford
Alec Radford is a bit unique among AI researchers in that he does not have a PHD or a master’s degree. However, he has shown you hardly need one as he has been instrumental at OpenAI and has been cited over 135,000 times across more than 40 papers in which he was at least coauthor. In addition to Radford, 3 other coauthors contributed, most notably Ilya Sutskever, who until very recently was chief scientist at OpenAI. In May 2024 Ilya left OpenAI after some turmoil over AI Safety (https://www.theverge.com/2024/5/14/24156920/openai-chief-scientist-ilya-sutskever-leaves). Who watches the watchmen? As of May 2024, no longer Ilya Sutskever.
Background
As we found out in “Attention is All You Need” their Transformer-based neural networks are very good at specific tasks after being trained on lots of samples of that task. In that paper it was specifically translating English to German. Traditionally when you are building a language model to do something, you show a neural network lots and lots of examples of that thing and it learns how to do it very well, with little ability to do anything else. But researchers have long been searching for a more “general” intelligence (https://en.wikipedia.org/wiki/Artificial_general_intelligence). After all, human brains can apply knowledge and experience across domains, why cant language models?
Remember that when you are training a neural network model you are just training billions (or trillions) of “weights” (numbers) that are adjusted based on training data. As you show more examples it gets better at predicting that training data. But what if with Transformer models, it’s not just learning what you are showing it? What if it is also baking into those numbers more of a general understanding of language, of relationships between words and concepts, dare I say…of Humanity?
Generative Pre-Training
This was the fundamental idea behind Radford’s paper: That just showing a Transformer-based neural network large amounts of any random long strings of text, it might learn to understand relationships in human language that can help it on other tasks too. To this end they trained their network on the “BooksCorpus dataset” which “contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.”
(To contrast that, GPT4 was trained on roughly 104,000,000 books).
Note: They call this technique “semi-supervised” in that they are training it by having it always try to predict the next word in the text, and adjusting the “weights” according to how well it predicts that word. But it’s not fully supervised like the english to german translation, as there is no exact answer for them to compare against. (This author’s opinion is it’s pretty much fully supervised but what do I know)
They focused on 4 specific tasks to evaluate:
“natural language inference” - Given two sentences, determine if the first follows from the second.
“Danny likes great Music” ⇒ “Danny likes Post Malone”. Those two sentences have a high natural language “entailment” score (given that you accept that post malone makes great music). Others sentence pairs may have a high “contradiction” or “neutral” score.
“question answering” - Literally just what ChatGPT does
“semantic similarity” - Given two sentences, how close are they in conveying the same thing
“text classification” - Determine which group a piece of text fits better with (in this case positive or negative sentiment)
For each of these they used an existing dataset that has training examples and the correct answer for each example so they can score how well their models did.
Fine-tuning
OKAY, you caught me. They didn’t only do the generative-training. (Hence the “pre”.) After the pre-training, they used a subset of their dataset for each task to train their model further for that specific task. “Then whats the freaking point? This is just my Dad’s neural networks with some fancy lipstick.” True, they still trained it on the task at hand but they had a hypothesis that their pre-training might make it even better. If you teach a child Algebra does it make them better at their future job doing Journalism? (My high school algebra teacher said that it does.) They wanted to test if pre-training can teach it fundamentals about the english language that pay dividends across multiple tasks. And guess what they found…
Results
Can you believe it??? IT WORKED! They were indeed able to show that by just showing the neural network lots and lots of text, it was able to learn enough to be adept at a wide variety of tasks. They used “Generative” (Generating the next word in a sequence of text) “Pre-Training” (Training it ahead of time before testing it on various tasks) to create a model that could excel at a variety of tasks. On 9 out of the 12 tasks they tested it against their model achieved a new “state of the art”, outperforming models that were trained on those tasks explicitly.
One of the things that I find notable about this paper is that it’s not that particularly surprising of an idea (though of course hindsight == 20:20). They mostly just took the model from the “Attention is All You Need” paper, threw a bunch of text at it and noticed that it does more than just english-to-german translation. To me it seems like a natural progression, but this simple idea had a large impact. I’ll note that one of the tasks it excelled at is ChatGPT’s bread and butter: “Question Answering” (cue the future of OpenAI and it’s $80 billion valuation.) I wonder if Alec Radford realized that his paper would set OpenAI and the entire AI industry down a path towards far and away the closest thing to artificial human-level intelligence we have ever seen: bigger and better Generative Pre-training Transformer (GPT) models such as ChatGPT. The only question remaining is, is sufficient generative pre-training enough to achieve Artificial General Intelligence? OpenAI is certainly aiming to find out.