Ever since its November 2022 release, ChatGPT has taken the world by storm. Given the rapid rate of change in this space, it can be difficult to keep up with iterative developments—especially for those new to the field or out of touch with recent progress.
While there have been many important contributions that have made ChatGPT a reality, there are essentially three papers published in the last six years that have had an outsized impact on making LLMs exciting to the broader world.
So what can you expect to learn from reading these papers? We thought it would be interesting to show a high-level overview of each paper with the help of a word cloud.
1. 2017: Google publishes the "Transformer" paper
The "Transformer" paper, officially published as "Attention Is All You Need" made the following important contributions:
- Before the transformer paper, RNNs (Recurrent Neural Networks) / LSTMs were the state of the art. The primary drawbacks of these approaches were that they could not remember an entire sequence of input at once, nor could they be parallelized as they required sequences of data to be processed in a fixed order. Transformers solved both of these drawbacks.
- Transformers introduced the concept of
attentioni.e. the ability of a model to pay attention to the most important parts of an input. Moreover, the transformer model does not require sequences of data to be processed in any fixed order making them parallelizable and hence efficient to train. - Transformers introduced an
encoderanddecoderarchitecture that utilizespositional encodingsandmulti-head attention. - An "encoder" learns the context of a language and a "decoder" does specific tasks.
2. 2019: Google publishes the "BERT" paper
The "BERT" paper, officially published as "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" made the following important contributions:
-
Following up on its success of the transformer paper, Google next released the "BERT" model that is based on the transformer architecture. The key difference compared to transformers is that a
BERT model only utilizes "encoders" and does not use any "decoders". -
BERT models have proven to be very efficient at several NLP tasks such as:
- Sentiment Analysis: E.g. determine if a movie review is positive or negative.
- Question Answering: E.g. ChatGPT style chatbots that respond to input question prompts.
- Text Prediction: E.g. Auto-completing text when writing a message, email, or search query.
- Text Generation: E.g. ChatGPT generates an article about a given topic based on a simple input query.
- Summarization: E.g. Summarize legal contracts, PDF documents, blog posts etc.
- Polysemy Resolution : E.g. Can differentiate words that have multiple meanings (like ‘bank’) based on the surrounding text.
- Prior to BERT each of these tasks required a separate NLP model.
3. 2023: Meta publishes the "LLaMA" paper
The "LLaMA" paper, officially published as "LLaMA: Open and Efficient Foundation Language Models" made the following important contributions:
- Open-sourced a 65-billion-parameter model trained on trillions of tokens using only publicly available datasets and comparable in performance to ChatGPT. This is in stark contrast to OpenAI's ChatGPT based models which are closed-source.
- The LLaMA paper helped advance open-source research into LLMs and has led to the release of several open-source assistant-style large language models such as GPT4All.
- The LLaMA model is based on the transformer architecture. However, unlike the BERT model the
LLaMA model only utilizes "decoders" and does not use any "encoders". This is also how ChatGPT works.
One interesting characteristic of all these models is that they are all considered "foundational" and "general-purpose" as they are trained on a large dataset of unlabeled data. Many interesting applications come from fine-tuning these models for specific tasks using a machine learning technique known as "transfer learning". See the wildly popular "HuggingFace" community for examples of such fine-tuned models.