GPT-1 by OpenAI
OpenAI released GPT-1 (Generative Pre-trained Transformer), demonstrating that unsupervised pre-training on large text corpora followed by supervised fine-tuning could produce strong NLP results. With 117 million parameters, it was modest by later standards but proved the viability of the generative pre-training approach. GPT-1 set the stage for the scaling revolution that followed.
In June 2018, OpenAI published "Improving Language Understanding by Generative Pre-Training," introducing GPT-1 (Generative Pre-trained Transformer). With 117 million parameters -- modest by today's standards -- GPT-1 demonstrated that a simple two-stage approach of unsupervised pre-training followed by supervised fine-tuning could achieve strong results across a range of natural language tasks.
The Approach
GPT-1's method was straightforward. First, the model was pre-trained on a large corpus of text (the BookCorpus, containing about 7,000 unpublished books) using a standard language modeling objective: predict the next word in a sequence. This unsupervised pre-training phase taught the model general knowledge about language, facts, reasoning patterns, and writing styles. Second, the model was fine-tuned on specific tasks like text classification, natural language inference, and question answering using labeled datasets.
The Architecture
GPT-1 used a 12-layer Transformer decoder, following the architecture from "Attention Is All You Need" but using only the decoder portion. This meant the model was autoregressive -- it generated text one token at a time, each time attending to all previously generated tokens. The choice of a decoder-only architecture, in contrast to BERT's encoder-only approach, would prove significant: it made GPT naturally suited for text generation, not just understanding.
Why It Mattered
GPT-1 was not the first transfer learning approach for NLP -- ELMo and ULMFiT had shown promising results earlier in 2018. However, GPT-1 demonstrated that the Transformer architecture was particularly well-suited for this paradigm and that generative pre-training could compete with or exceed discriminative approaches. The results improved the state of the art on 9 of 12 evaluated benchmarks.
The OpenAI Context
OpenAI, founded in 2015 with $1 billion in pledged funding, had been exploring various approaches to AI. GPT-1 represented a strategic bet on scaling language models that would define the organization's trajectory. The paper's lead author, Alec Radford, along with colleagues Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, laid the foundation for what would become one of the most impactful research directions in AI history.
The Scaling Hypothesis
Perhaps the most important aspect of GPT-1 was not the model itself but what it suggested about scaling. The paper showed that pre-training quality improved with more data and more parameters, and that this improvement transferred to downstream tasks. This observation -- that bigger models trained on more data would be better -- became the driving principle behind GPT-2, GPT-3, and GPT-4. OpenAI recognized early that scaling would be the key to capability gains.
Comparing GPT-1 and BERT
GPT-1 and BERT were published within months of each other in 2018, and they represented complementary approaches. BERT excelled at understanding tasks (classification, question answering) thanks to its bidirectional training. GPT excelled at generation tasks thanks to its autoregressive nature. Initially, BERT received more attention due to its stronger benchmark results. But the GPT line's ability to generate coherent text would prove to be the more commercially significant capability.
Key Figures
Lasting Impact
GPT-1 proved the viability of generative pre-training with Transformers and established the foundation for the scaling revolution. It set OpenAI on the trajectory that would lead to GPT-2, GPT-3, GPT-4, and ChatGPT.