2018Model

BERT by Google

Google released BERT (Bidirectional Encoder Representations from Transformers), a pre-trained language model that achieved state-of-the-art results across eleven NLP benchmarks. BERT's bidirectional training approach allowed it to understand context from both directions in a sentence. It was quickly integrated into Google Search, improving understanding of one in ten English queries.

In October 2018, Google AI researchers published "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding," introducing a model that would fundamentally change natural language processing. BERT achieved state-of-the-art results on eleven different NLP benchmarks simultaneously, often by significant margins, demonstrating that a single pre-trained model could be adapted to a wide range of language tasks.

The Key Innovation

BERT's breakthrough was bidirectional pre-training. Previous language models, including GPT-1, were trained to predict the next word in a sequence -- reading left to right. BERT instead used a technique called Masked Language Modeling (MLM), where random words in a sentence were replaced with a special [MASK] token, and the model had to predict the original word based on context from both directions. This allowed BERT to develop a deeper understanding of language, since the meaning of a word often depends on what comes after it as well as what comes before.

Pre-Training and Fine-Tuning

BERT introduced a powerful paradigm: pre-train a large model on massive amounts of unlabeled text, then fine-tune it on specific tasks with much smaller labeled datasets. The pre-training phase, conducted on the entirety of English Wikipedia and the BookCorpus (about 3.3 billion words), taught BERT general language understanding. Fine-tuning required only small modifications and far less data, making it accessible to researchers without massive computational resources.

Technical Details

BERT came in two sizes: BERT-Base (110 million parameters, 12 layers) and BERT-Large (340 million parameters, 24 layers). Both used the Transformer encoder architecture from the "Attention Is All You Need" paper. In addition to masked language modeling, BERT was trained with a Next Sentence Prediction (NSP) task, where it learned to determine whether two sentences naturally follow each other -- a capability useful for tasks like question answering and natural language inference.

The Benchmark Results

BERT's results were striking. It improved the state of the art on the GLUE language understanding benchmark by 7.7 percentage points. On the SQuAD question-answering dataset, it surpassed human-level performance. These were not incremental improvements -- they were generational leaps that made previous approaches obsolete virtually overnight.

Impact on Google Search

Google integrated BERT into its search engine in October 2019, applying it to improve understanding of search queries. Google reported that BERT helped with about 10 percent of English-language queries, particularly those involving prepositions and nuanced phrasing where context matters. For example, BERT could better understand that "2019 brazil traveler to usa need a visa" is about a Brazilian traveling to the US, not an American traveling to Brazil.

Broader Impact

BERT catalyzed an explosion of research in transfer learning for NLP. Dozens of BERT variants followed: RoBERTa (optimized training), ALBERT (parameter-efficient), DistilBERT (compressed), and domain-specific versions for biomedical text, legal documents, and code. The pre-train-then-fine-tune paradigm became the standard approach for NLP research and applications.

The Team

The paper was authored by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI Language. Devlin, the lead author, developed BERT during a relatively focused research effort, and the paper's clear writing and reproducible results contributed to its rapid adoption by the research community.

Key Figures

Jacob DevlinMing-Wei ChangKenton LeeKristina Toutanova

Lasting Impact

BERT established the pre-train-then-fine-tune paradigm that became the standard approach for NLP, achieving dramatic improvements across virtually every language understanding benchmark. Its integration into Google Search directly improved the search experience for billions of users.

Related Events

2017Research

Transformer Architecture Paper

Google researchers published 'Attention Is All You Need,' introducing the Transformer architecture that replaced recurrence with self-attention mechanisms. Transformers enabled massively parallel training and captured long-range dependencies in text far more effectively than previous approaches. This paper became the foundation for virtually every major language model that followed.

2018Model

GPT-1 by OpenAI

OpenAI released GPT-1 (Generative Pre-trained Transformer), demonstrating that unsupervised pre-training on large text corpora followed by supervised fine-tuning could produce strong NLP results. With 117 million parameters, it was modest by later standards but proved the viability of the generative pre-training approach. GPT-1 set the stage for the scaling revolution that followed.

2019Model

GPT-2 Released

OpenAI initially withheld GPT-2, citing concerns that its 1.5 billion parameter model could be misused to generate convincing fake text at scale. The decision sparked widespread debate about responsible AI disclosure and the dual-use nature of powerful language models. GPT-2 was eventually released in stages, and its text generation quality surprised many researchers.