Ever since its November 2022 release, ChatGPT has taken the world by storm. Given the rapid rate of change in this space, it can be difficult to keep up with iterative developments—especially for those new to the field or out of touch with recent progress.

While there have been many important contributions that have made ChatGPT a reality, there are essentially three papers published in the last six years that have had an outsized impact on making LLMs exciting to the broader world.

LLMs Timeline
A chronological timeline of pivotal LLM research milestones, charting the evolution from Google's 2017 Transformer paper and BERT in 2018 to Meta's open-source LLaMA in 2023.

So what can you expect to learn from reading these papers? We thought it would be interesting to show a high-level overview of each paper with the help of a word cloud.

1. 2017: Google publishes the "Transformer" paper


Word Cloud of Google's Transformer Paper
A semantic word cloud extracted from Google's seminal 2017 'Attention Is All You Need' paper, highlighting the dominant concepts of self-attention, encoding, and translation.

The "Transformer" paper, officially published as "Attention Is All You Need" made the following important contributions:

  • Before the transformer paper, RNNs (Recurrent Neural Networks) / LSTMs were the state of the art. The primary drawbacks of these approaches were that they could not remember an entire sequence of input at once, nor could they be parallelized as they required sequences of data to be processed in a fixed order. Transformers solved both of these drawbacks.
  • Transformers introduced the concept of attention i.e. the ability of a model to pay attention to the most important parts of an input. Moreover, the transformer model does not require sequences of data to be processed in any fixed order making them parallelizable and hence efficient to train.
  • Transformers introduced an encoder and decoder architecture that utilizes positional encodings and multi-head attention.
  • An "encoder" learns the context of a language and a "decoder" does specific tasks.

2. 2019: Google publishes the "BERT" paper


Word Cloud of Google's BERT Paper
A semantic word cloud from Google's 2018 BERT paper, illustrating the focus on bidirectional context, masking, and pre-training representations.

The "BERT" paper, officially published as "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" made the following important contributions:

  • Following up on its success of the transformer paper, Google next released the "BERT" model that is based on the transformer architecture. The key difference compared to transformers is that a BERT model only utilizes "encoders" and does not use any "decoders".

  • BERT models have proven to be very efficient at several NLP tasks such as:

    • Sentiment Analysis: E.g. determine if a movie review is positive or negative.
    • Question Answering: E.g. ChatGPT style chatbots that respond to input question prompts.
    • Text Prediction: E.g. Auto-completing text when writing a message, email, or search query.
    • Text Generation: E.g. ChatGPT generates an article about a given topic based on a simple input query.
    • Summarization: E.g. Summarize legal contracts, PDF documents, blog posts etc.
    • Polysemy Resolution : E.g. Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text.
    • Prior to BERT each of these tasks required a separate NLP model.

3. 2023: Meta publishes the "LLaMA" paper


Word Cloud of Meta's LLaMA Paper
A semantic word cloud from Meta's 2023 LLaMA paper, emphasizing parameter efficiency, open-source model optimization, and downstream benchmarks.

The "LLaMA" paper, officially published as "LLaMA: Open and Efficient Foundation Language Models" made the following important contributions:

  • Open-sourced a 65-billion-parameter model trained on trillions of tokens using only publicly available datasets and comparable in performance to ChatGPT. This is in stark contrast to OpenAI's ChatGPT based models which are closed-source.
  • The LLaMA paper helped advance open-source research into LLMs and has led to the release of several open-source assistant-style large language models such as GPT4All.
  • The LLaMA model is based on the transformer architecture. However, unlike the BERT model the LLaMA model only utilizes "decoders" and does not use any "encoders". This is also how ChatGPT works.

One interesting characteristic of all these models is that they are all considered "foundational" and "general-purpose" as they are trained on a large dataset of unlabeled data. Many interesting applications come from fine-tuning these models for specific tasks using a machine learning technique known as "transfer learning". See the wildly popular "HuggingFace" community for examples of such fine-tuned models.