2018Model

GPT-1 by OpenAI

OpenAI released GPT-1 (Generative Pre-trained Transformer), demonstrating that unsupervised pre-training on large text corpora followed by supervised fine-tuning could produce strong NLP results. With 117 million parameters, it was modest by later standards but proved the viability of the generative pre-training approach. GPT-1 set the stage for the scaling revolution that followed.

In June 2018, OpenAI published "Improving Language Understanding by Generative Pre-Training," introducing GPT-1 (Generative Pre-trained Transformer). With 117 million parameters -- modest by today's standards -- GPT-1 demonstrated that a simple two-stage approach of unsupervised pre-training followed by supervised fine-tuning could achieve strong results across a range of natural language tasks.

The Approach

GPT-1's method was straightforward. First, the model was pre-trained on a large corpus of text (the BookCorpus, containing about 7,000 unpublished books) using a standard language modeling objective: predict the next word in a sequence. This unsupervised pre-training phase taught the model general knowledge about language, facts, reasoning patterns, and writing styles. Second, the model was fine-tuned on specific tasks like text classification, natural language inference, and question answering using labeled datasets.

The Architecture

GPT-1 used a 12-layer Transformer decoder, following the architecture from "Attention Is All You Need" but using only the decoder portion. This meant the model was autoregressive -- it generated text one token at a time, each time attending to all previously generated tokens. The choice of a decoder-only architecture, in contrast to BERT's encoder-only approach, would prove significant: it made GPT naturally suited for text generation, not just understanding.

Why It Mattered

GPT-1 was not the first transfer learning approach for NLP -- ELMo and ULMFiT had shown promising results earlier in 2018. However, GPT-1 demonstrated that the Transformer architecture was particularly well-suited for this paradigm and that generative pre-training could compete with or exceed discriminative approaches. The results improved the state of the art on 9 of 12 evaluated benchmarks.

The OpenAI Context

OpenAI, founded in 2015 with $1 billion in pledged funding, had been exploring various approaches to AI. GPT-1 represented a strategic bet on scaling language models that would define the organization's trajectory. The paper's lead author, Alec Radford, along with colleagues Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, laid the foundation for what would become one of the most impactful research directions in AI history.

The Scaling Hypothesis

Perhaps the most important aspect of GPT-1 was not the model itself but what it suggested about scaling. The paper showed that pre-training quality improved with more data and more parameters, and that this improvement transferred to downstream tasks. This observation -- that bigger models trained on more data would be better -- became the driving principle behind GPT-2, GPT-3, and GPT-4. OpenAI recognized early that scaling would be the key to capability gains.

Comparing GPT-1 and BERT

GPT-1 and BERT were published within months of each other in 2018, and they represented complementary approaches. BERT excelled at understanding tasks (classification, question answering) thanks to its bidirectional training. GPT excelled at generation tasks thanks to its autoregressive nature. Initially, BERT received more attention due to its stronger benchmark results. But the GPT line's ability to generate coherent text would prove to be the more commercially significant capability.

Key Figures

Alec RadfordKarthik NarasimhanTim SalimansIlya Sutskever

Lasting Impact

GPT-1 proved the viability of generative pre-training with Transformers and established the foundation for the scaling revolution. It set OpenAI on the trajectory that would lead to GPT-2, GPT-3, GPT-4, and ChatGPT.

Related Events

2017Research

Transformer Architecture Paper

Google researchers published 'Attention Is All You Need,' introducing the Transformer architecture that replaced recurrence with self-attention mechanisms. Transformers enabled massively parallel training and captured long-range dependencies in text far more effectively than previous approaches. This paper became the foundation for virtually every major language model that followed.

2018Model

BERT by Google

Google released BERT (Bidirectional Encoder Representations from Transformers), a pre-trained language model that achieved state-of-the-art results across eleven NLP benchmarks. BERT's bidirectional training approach allowed it to understand context from both directions in a sentence. It was quickly integrated into Google Search, improving understanding of one in ten English queries.

2019Model

GPT-2 Released

OpenAI initially withheld GPT-2, citing concerns that its 1.5 billion parameter model could be misused to generate convincing fake text at scale. The decision sparked widespread debate about responsible AI disclosure and the dual-use nature of powerful language models. GPT-2 was eventually released in stages, and its text generation quality surprised many researchers.