Transformer in NLP

This list is also at https://github.com/cedrickchee/awesome-transformer-nlp

Transformer & Transfer Learning in NLP

This repository contains a hand-curated of great machine (deep) learning resources for Natural Language Processing (NLP) with a focus on Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), attention mechanism, Transformer architectures/networks, ChatGPT, and transfer learning in NLP.

Transformer

Transformer (BERT) (Source)

Papers

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le and Ruslan Salakhutdinov.

Uses smart caching to improve the learning of long-term dependency in Transformer. Key results: state-of-art on 5 language modeling benchmarks, including ppl of 21.8 on One Billion Word (LM1B) and 0.99 on enwiki8. The authors claim that the method is more flexible, faster during evaluation (1874 times speedup), generalizes well on small datasets, and is effective at modeling short and long sequences.

Conditional BERT Contextual Augmentation by Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han and Songlin Hu.
SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering by Chenguang Zhu, Michael Zeng and Xuedong Huang.
Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.
The Evolved Transformer by David R. So, Chen Liang and Quoc V. Le.

They used architecture search to improve Transformer architecture. Key is to use evolution and seed initial population with Transformer itself. The architecture is better and more efficient, especially for small size models.

XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

A new pretraining method for NLP that significantly improves upon BERT on 20 tasks (e.g., SQuAD, GLUE, RACE).
"Transformer-XL is a shifted model (each hyper-column ends with next token) while XLNet is a direct model (each hyper-column ends with contextual representation of same token)." — Thomas Wolf.
Comments from HN:
A clever dual masking-and-caching algorithm.
- This is NOT "just throwing more compute" at the problem.
- The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from all possible permutations of the factorization order of all other tokens in the same input sequence.
- In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.
- For example, if the input sequence has four tokens, ["The", "cat", "is", "furry"], in one training step the model will try to predict "is" after seeing "The", then "cat", then "furry".
- In another training step, the model might see "furry" first, then "The", then "cat".
- Note that the original sequence order is always retained, e.g., the model always knows that "furry" is the fourth token.
- The masking-and-caching algorithm that accomplishes this does not seem trivial to me.
- The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.

CTRL: Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar, Richard Socher et al. [Code].
PLMpapers - BERT (Transformer, transfer learning) has catalyzed research in pretrained language models (PLMs) and has sparked many extensions. This repo contains a list of papers on PLMs.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Google Brain.

The group perform a systematic study of transfer learning for NLP using a unified Text-to-Text Transfer Transformer (T5) model and push the limits to achieve SoTA on SuperGLUE (approaching human baseline), SQuAD, and CNN/DM benchmark. [Code].

Reformer: The Efficient Transformer by Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.

"They present techniques to reduce the time and memory complexity of Transformer, allowing batches of very long sequences (64K) to fit on one GPU. Should pave way for Transformer to be really impactful beyond NLP domain." — @hardmaru

Supervised Multimodal Bitransformers for Classifying Images and Text (MMBT) by Facebook AI.
A Primer in BERTology: What we know about how BERT works by Anna Rogers et al.

"Have you been drowning in BERT papers?". The group survey over 40 papers on BERT's linguistic knowledge, architecture tweaks, compression, multilinguality, and so on.

Key idea: the architecture use a subset of parameters on every training step and on each example. Upside: model train much faster. Downside: super large model that won't fit in a lot of environments.

An Attention Free Transformer by Apple.
A Survey of Transformers by Tianyang Lin et al.
Evaluating Large Language Models Trained on Code by OpenAI.

Codex, a GPT language model that powers GitHub Copilot.
They investigate their model limitations (and strengths).
They discuss the potential broader impacts of deploying powerful code generation techs, covering safety, security, and economics.

Training language models to follow instructions with human feedback by OpenAI. They call the resulting models InstructGPT. ChatGPT is a sibling model to InstructGPT.
LaMDA: Language Models for Dialog Applications by Google.
Training Compute-Optimal Large Language Models by Hoffmann et al. at DeepMind. TLDR: introduces a new 70B LM called "Chinchilla" that outperforms much bigger LMs (GPT-3, Gopher). DeepMind has found the secret to cheaply scale large language models — to be compute-optimal, model size and training data must be scaled equally. It shows that most LLMs are severely starved of data and under-trained. Given the new scaling law, even if you pump a quadrillion parameters into a model (GPT-4 urban myth), the gains will not compensate for 4x more training tokens.
Improving language models by retrieving from trillions of tokens by Borgeaud et al. at DeepMind - The group explore an alternate path for efficient training with Internet-scale retrieval. The method is known as RETRO, for "Retrieval Enhanced TRansfOrmers". With RETRO the model is not limited to the data seen during training – it has access to the entire training dataset through the retrieval mechanism. This results in significant performance gains compared to a standard Transformer with the same number of parameters. RETRO obtains comparable performance to GPT-3 on the Pile dataset, despite using 25 times fewer parameters. They show that language modeling improves continuously as they increase the size of the retrieval database. [blog post]
Scaling Instruction-Finetuned Language Models by Google - They find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks. Flan-PaLM 540B achieves SoTA performance on several benchmarks. They also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B.
Emergent Abilities of Large Language Models by Google Research, Stanford University, DeepMind, and UNC Chapel Hill.
Nonparametric Masked (NPM) Language Modeling by Meta AI et al. [code] - Nonparametric models with 500x fewer parameters outperform GPT-3 on zero-shot tasks.
It, crucially, does not have a softmax over a fixed output vocabulary, but instead has a fully nonparametric distribution over phrases. This is in contrast to a recent (2022) body of work that incorporates nonparametric components in a parametric model.
Results show that NPM is significantly more parameter-efficient, outperforming up to 500x larger parametric models and up to 37x larger retrieve-and-generate models.
Transformer models: an introduction and catalog by Xavier Amatriain, 2023 - The goal of this paper is to offer a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.

Articles

BERT and Transformer

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing from Google AI.
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning).
Dissecting BERT by Miguel Romero and Francisco Ingham - Understand BERT in depth with an intuitive, straightforward explanation of the relevant concepts.
A Light Introduction to Transformer-XL.
Generalized Language Models by Lilian Weng, Research Scientist at OpenAI.
What is XLNet and why it outperforms BERT

Permutation Language Modeling objective is the core of XLNet.

DistilBERT (from HuggingFace), released together with the blog post Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations paper from Google Research and Toyota Technological Institute. — Improvements for more efficient parameter usage: factorized embedding parameterization, cross-layer parameter sharing, and Sentence Order Prediction (SOP) loss to model inter-sentence coherence. [Blog post | Code]
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning - A BERT variant like ALBERT and cost less to train. They trained a model that outperforms GPT by using only one GPU; match the performance of RoBERTa by using 1/4 computation. It uses a new pre-training approach, called replaced token detection (RTD), that trains a bidirectional model while learning from all input positions. [Blog post | Code]
Visual Paper Summary: ALBERT (A Lite BERT)
Cramming: Training a Language Model on a Single GPU in One Day (paper) (2022) - While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? ... Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

Attention Concept

The Annotated Transformer by Harvard NLP Group - Further reading to understand the "Attention is all you need" paper.
Attention? Attention! - Attention guide by Lilian Weng from OpenAI.
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) by Jay Alammar, an Instructor from Udacity ML Engineer Nanodegree.
Making Transformer networks simpler and more efficient - FAIR released an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient (reduce computation time and memory footprint).
What Does BERT Look At? An Analysis of BERT’s Attention paper by Stanford NLP Group.

Transformer Architecture

The Transformer blog post.
The Illustrated Transformer by Jay Alammar, an Instructor from Udacity ML Engineer Nanodegree.
Watch Łukasz Kaiser’s talk walking through the model and its details.
Transformer-XL: Unleashing the Potential of Attention Models by Google Brain.
Generative Modeling with Sparse Transformers by OpenAI - an algorithmic improvement of the attention mechanism to extract patterns from sequences 30x longer than possible previously.
Stabilizing Transformers for Reinforcement Learning paper by DeepMind and CMU - they propose architectural modifications to the original Transformer and XL variant by moving layer-norm and adding gating creates Gated Transformer-XL (GTrXL). It substantially improve the stability and learning speed (integrating experience through time) in RL.
The Transformer Family by Lilian Weng - since the paper "Attention Is All You Need", many new things have happened to improve the Transformer model. This post is about that.
DETR (DEtection TRansformer): End-to-End Object Detection with Transformers by FAIR - 🔥 Computer vision has not yet been swept up by the Transformer revolution. DETR completely changes the architecture compared with previous object detection systems. (PyTorch Code and pretrained models). "A solid swing at (non-autoregressive) end-to-end detection. Anchor boxes + Non-Max Suppression (NMS) is a mess. I was hoping detection would go end-to-end back in ~2013)" — Andrej Karpathy
Transformers for software engineers - This post will be helpful to software engineers who are interested in learning ML models, especially anyone interested in Transformer interpretability. The post walk through a (mostly) complete implementation of a GPT-style Transformer, but the goal will not be running code; instead, they use the language of software engineering and programming to explain how these models work and articulate some of the perspectives they bring to them when doing interpretability work.
Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance - PaLM is a dense decoder-only Transformer model trained with the Pathways system, which enabled Google to efficiently train a single model across multiple TPU v4 Pods. The example explaining a joke is remarkable. This shows that it can generate explicit explanations for scenarios that require a complex combination of multi-step logical inference, world knowledge, and deep language understanding.
Efficient Long Sequence Modeling via State Space Augmented Transformer (paper) by Georgia Institute of Technology and Microsoft - The quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. They propose SPADE, short for State sPace AugmenteD TransformEr, which performs various baselines, including Mega, on the Long Range Arena benchmark and various LM tasks. This is an interesting direction. SSMs and Transformers were combined a while back.
DeepNet: Scaling Transformers to 1,000 Layers (paper) by Microsoft Research (2022) - The group introduced a new normalization function (DEEPNORM) to modify the residual connection in Transformer and showed that model updates can be bounded in a stable way. This improve the training stability of deep Transformers and scale the model depth by orders of magnitude (10x) compared to Gpipe (pipeline parallelism) by Google Brain (2019). (who remembers what ResNet (2015) did to ConvNet?)
A Length-Extrapolatable Transformer (paper) by Microsoft (2022) [TorchScale code] - This improves modeling capability of scaling Transformers.
Hungry Hungry Hippos (H3): Towards Language Modeling with State Space Models (SSMs) (paper) by Stanford AI Lab (2022) - A new language modeling architecture. It scales nearly linearly with context size instead of quadratically. No more fixed context windows, long context for everyone. Despite that, SSMs are still slower than Transformers due to poor hardware utilization. So, a Transformer successor? [Tweet]
Accelerating Large Language Model Decoding with Speculative Sampling (paper) by DeepMind (2023) - Speculative sampling algorithm enable the generation of multiple tokens from each transformer call. Achieves a 2–2.5x decoding speedup with Chinchilla in a distributed setup, without compromising the sample quality or making modifications to the model itself.

Generative Pre-Training Transformer (GPT)

Better Language Models and Their Implications.
Improving Language Understanding with Unsupervised Learning - this is an overview of the original OpenAI GPT model.
🦄 How to build a State-of-the-Art Conversational AI with Transfer Learning by Hugging Face.
The Illustrated GPT-2 (Visualizing Transformer Language Models) by Jay Alammar.
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism by NVIDIA ADLR.
OpenGPT-2: We Replicated GPT-2 Because You Can Too - the authors trained a 1.5 billion parameter GPT-2 model on a similar sized text dataset and they reported results that can be compared with the original model.
MSBuild demo of an OpenAI generative text model generating Python code [video] - The model that was trained on GitHub OSS repos. The model uses English-language code comments or simply function signatures to generate entire Python functions. Cool!
GPT-3: Language Models are Few-Shot Learners (paper) by Tom B. Brown (OpenAI) et al. - "We train GPT-3, an autoregressive language model with 175 billion parameters 😱, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting."
elyase/awesome-gpt3 - A collection of demos and articles about the OpenAI GPT-3 API.
How GPT3 Works - Visualizations and Animations by Jay Alammar.
GPT-Neo - Replicate a GPT-3 sized model and open source it for free. GPT-Neo is "an implementation of model parallel GPT2 & GPT3-like models, with the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow library." [Code].
GitHub Copilot, powered by OpenAI Codex - Codex is a descendant of GPT-3. Codex translates natural language into code.
GPT-4 Rumors From Silicon Valley - GPT-4 is almost ready. GPT-4 would be multimodal, accepting text, audio, image, and possibly video inputs. Release window: Dec - Feb. #hype
New GPT-3 model: text-Davinci-003 - Improvements:

Handle more complex intents — you can get even more creative with how you make use of its capabilities now.
Higher quality writing — clearer, more engaging, and more compelling content.
Better at longer form content generation.

ChatGPT

What is ChatGPT?

TL;DR: ChatGPT is a conversational web interface, backed by OpenAI's newest language model fine-tuned from a model in the GPT-3.5 series (which finished training in early 2022), optimized for dialogue. It is trained using Reinforcement Learning from Human Feedback (RLHF); human AI trainers provide supervised fine-tuning by playing both sides of the conversation.

It's evidently better than GPT-3 at following user instructions and context. People have noticed ChatGPT's output quality seems to represent a notable improvement over previous GPT-3 models.

For more, please take a look at ChatGPT Universe. This is my fleeting notes on everything I understand about ChatGPT and stores a collection of interesting things about ChatGPT.

Large Language Model (LLM)

GPT-J-6B - Can't access GPT-3? Here's GPT-J — its open-source cousin.
Fun and Dystopia With AI-Based Code Generation Using GPT-J-6B - Prior to GitHub Copilot tech preview launch, Max Woolf, a data scientist tested GPT-J-6B's code "writing" abilities.
GPT-Code-Clippy (GPT-CC) - An open source version of GitHub Copilot. The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.
GPT-NeoX-20B - A 20 billion parameter model trained using EleutherAI’s GPT-NeoX framework. They expect it to perform well on many tasks. You can try out the model on GooseAI playground.
Metaseq - A codebase for working with Open Pre-trained Transformers (OPT).
YaLM 100B by Yandex is a GPT-like pretrained language model with 100B parameters for generating and processing text. It can be used freely by developers and researchers from all over the world.
BigScience's BLOOM-176B from the Hugging Face repository [paper, blog post] - BLOOM is a 175-billion parameter model for language processing, able to generate text much like GPT-3 and OPT-175B. It was developed to be multilingual, being deliberately trained on datasets containing 46 natural languages and 13 programming languages.
bitsandbytes-Int8 inference for Hugging Face models - You can run BLOOM-176B/OPT-175B easily on a single machine, without performance degradation. If true, this could be a game changer in enabling people outside of big tech companies being able to use these LLMs.
WeLM: A Well-Read Pre-trained Language Model for Chinese (paper) by WeChat. [online demo]
GLM-130B: An Open Bilingual (Chinese and English) Pre-Trained Model (code and paper) by Tsinghua University, China [article] - One of the major contributions is making LLMs cost affordable using int4 quantization so it can run in limited compute environments.
The resultant GLM-130B model offers significant outperformance over GPT-3 175B on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization, without quantization aware training and with almost no performance loss, making it the first among 100B-scale models. More importantly, the property allows its effective inference on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs, the most ever affordable GPUs required for using 100B-scale models.
Teaching Small Language Models to Reason (paper) - They finetune a student model on the chain of thought (CoT) outputs generated by a larger teacher model. For example, the accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.
ALERT: Adapting Language Models to Reasoning Tasks (paper) by Meta AI - They introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. It covers 10 different reasoning skills including logistic, causal, common-sense, abductive, spatial, analogical, argument and deductive reasoning as well as textual entailment, and mathematics.
Evaluating Human-Language Model Interaction (paper) by Stanford University and Imperial College London - They find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction.
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor (paper) by Meta AI [data] - Fine-tuning a T5 on a large dataset collected with virtually no human labor leads to a model that surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.
OPT-IML (OPT + Instruction Meta-Learning) (paper) by Meta AI - OPT-IML is a set of instruction-tuned versions of OPT, on a collection of ~2000 NLP tasks — for research use cases. It boosts the performance of the original OPT-175B model using instruction tuning to improve zero-shot and few-shot generalization abilities — allowing it to adapt for more diverse language applications (i.e., answering Q’s, summarizing text). This improves the model's ability to better process natural instruction style prompts. Ultimately, humans should be able to "talk" to models as naturally and fluidly as possible. [code (available soon), weights released]
jeffhj/LM-reasoning - This repository contains a collection of papers and resources on reasoning in Large Language Models.
Rethinking with Retrieval: Faithful Large Language Model Inference (paper) by University of Pennsylvania et al., 2022 - They shows the potential of enhancing LLMs by retrieving relevant external knowledge based on decomposed reasoning steps obtained through chain-of-thought (CoT) prompting. I predict we're going to see many of these types of retrieval-enhanced LLMs in 2023.
REPLUG: Retrieval-Augmented Black-Box Language Models (paper) by Meta AI et al., 2023 - TL;DR: Enhancing GPT-3 with world knowledge — a retrieval-augmented LM framework that combines a frozen LM with a frozen/tunable retriever. It improves GPT-3 in language modeling and downstream tasks by prepending retrieved documents to LM inputs. [Tweet]
Progressive Prompts: Continual Learning for Language Models (paper) by Meta AI et al., 2023 - Current LLMs have hard time with catastrophic forgetting and leveraging past experiences. The approach learns a prompt for new task and concatenates with frozen previously learned prompts. This efficiently transfers knowledge to future tasks. [code]
Large Language Models Can Be Easily Distracted by Irrelevant Context (paper) by Google Research et al., 2023 - Adding the instruction "Feel free to ignore irrelevant information given in the questions." consistently improves robustness to irrelevant context.
Toolformer: Language Models Can Teach Themselves to Use Tools (paper) by Meta AI, 2023 - A smaller model trained to translate human intention into actions (i.e. decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction).
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation (paper) by Baidu et al., 2021 - ERNIE 3.0 Titan is the latest addition to Baidu's ERNIE (Enhanced Representation through kNowledge IntEgration) family. It's inspired by the masking strategy of Google's BERT. ERNIE is also a unified framework. They also proposed a controllable learning algorithm and a credible learning algorithm. They apply online distillation technique to compress their model. To their knowledge, it is the largest (260B parameters) Chinese dense pre-trained model so far. [article]
Characterizing Attribution and Fluency Tradeoffs for Retrieval-Augmented Large Language Models (paper) by Google Research, 2023 - Despite recent progress, it has been difficult to prevent semantic hallucinations in generative LLMs. One common solution to this is augmenting LLMs with a retrieval system and making sure that the generated output is attributable to the retrieved information.

Transformer Reinforcement Learning

Transformer Reinforcement Learning from Human Feedback (RLHF).

Illustrating Reinforcement Learning from Human Feedback - Recent advances with language models (ChatGPT for example) have been powered by RLHF.
Training a Helpful and Harmless Assistant with RLHF (paper) by Anthropic. [code and red teaming data, tweet]
The Wisdom of Hindsight Makes Language Models Better Instruction Followers (paper) by UC Berkeley, 2023 - The underlying RLHF algo is complex and requires an additional training pipeline for reward and value networks. They consider an alternative approach, Hindsight Instruction Relabeling (HIR): converting feedback to instruction by relabeling the original one and training the model for better alignment.

Tools for RLHF

lvwerra/TRL - Train transformer language models with reinforcement learning.

Open source effort towards ChatGPT:

CarperAI/TRLX - Originated as a fork of TRL. It allows you to fine-tune Hugging Face language models (GPT2, GPT-NeoX based) up to 20B parameters using Reinforcement Learning. Brought to you by CarperAI (born at EleutherAI, an org part of StabilityAI family). CarperAI is developing production ready open-source RLHF tools. They have announced plans for the first open-source "instruction-tuned" LM.
allenai/RL4LMs - RL for language models (RL4LMs) by Allen AI. It's a modular RL library to fine-tune language models to human preferences.

Additional Reading

How to Build OpenAI's GPT-2: "The AI That's Too Dangerous to Release".
OpenAI’s GPT2 - Food to Media hype or Wake Up Call?
How the Transformers broke NLP leaderboards by Anna Rogers. 🔥🔥🔥

A well put summary post on problems with large models that dominate NLP these days.
Larger models + more data = progress in Machine Learning research ❓

Transformers From Scratch tutorial by Peter Bloem.
Real-time Natural Language Understanding with BERT using NVIDIA TensorRT on Google Cloud T4 GPUs achieves 2.2 ms latency for inference. Optimizations are open source on GitHub.
NLP's Clever Hans Moment has Arrived by The Gradient.
Language, trees, and geometry in neural networks - a series of expository notes accompanying the paper, "Visualizing and Measuring the Geometry of BERT" by Google's People + AI Research (PAIR) team.
Benchmarking Transformers: PyTorch and TensorFlow by Hugging Face - a comparison of inference time (on CPU and GPU) and memory usage for a wide range of transformer architectures.
Evolution of representations in the Transformer - An accessible article that presents the insights of their EMNLP 2019 paper. They look at how the representations of individual tokens in Transformers trained with different objectives change.
The dark secrets of BERT - This post probes fine-tuned BERT models for linguistic knowledge. In particular, the authors analyse how many self-attention patterns with some linguistic interpretation are actually used to solve downstream tasks. TL;DR: They are unable to find evidence that linguistically interpretable self-attention maps are crucial for downstream performance.
A Visual Guide to Using BERT for the First Time - Tutorial on using BERT in practice, such as for sentiment analysis on movie reviews by Jay Alammar.
Turing-NLG: A 17-billion-parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. This work would not be possible without breakthroughs produced by the DeepSpeed library (compatible with PyTorch) and ZeRO optimizer, which can be explored more in this accompanying blog post.
MUM (Multitask Unified Model): A new AI milestone for understanding information by Google.

Based on transformer architecture but more powerful.
Multitask means: supports text and images, knowledge transfer between 75 languages, understand context and go deeper in a topic, and generate content.

GPT-3 is No Longer the Only Game in Town - GPT-3 was by far the largest AI model of its kind last year (2020). Now? Not so much.
OpenAI's API Now Available with No Waitlist - GPT-3 access without the wait. However, apps must be approved before going live. This release also allow them to review applications, monitor for misuse, and better understand the effects of this tech.
The Inherent Limitations of GPT-3 - One thing missing from the article if you've read Gwern's GPT-3 Creative Fiction article before is the mystery known as "Repetition/Divergence Sampling":
when you generate free-form completions, they have a tendency to eventually fall into repetitive loops of gibberish.
For those using Copilot, you should have experienced this wierdness where it generates the same line or block of code over and over again.
Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval by DeepMind - The paper present an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
Competitive programming with AlphaCode by DeepMind - AlphaCode uses transformer-based language models to generate code that can create novel solutions to programming problems which require an understanding of algorithms.
Building games and apps entirely through natural language using OpenAI's code-davinci model - The author built several small games and apps without touching a single line of code, simply by telling the model what they want.
Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers
GPT-3 can run code - You provide an input text and a command and GPT-3 will transform them into an expected output. It works well for tasks like changing coding style, translating between programming languages, refactoring, and adding doc. For example, converts JSON into YAML, translates Python code to JavaScript, improve the runtime complexity of the function.
Using GPT-3 to explain how code works by Simon Willison.
Character AI announces they're building a full stack AGI company so you could create your own AI to help you with anything, using conversational AI research. The co-founders Noam Shazeer (co-invented Transformers, scaled them to supercomputers for the first time, and pioneered large-scale pretraining) and Daniel de Freitas (led the development of LaMDA), all of which are foundational to recent AI progress.
How Much Better is OpenAI’s Newest GPT-3 Model? - In addition to ChatGPT, OpenAI releases text-davinci-003, a Reinforcement Learning-tuned model that performs better long-form writing. Example, it can explain code in the style of Eminem. 😀
OpenAI rival Cohere launches language model API - Backed by AI experts, they aims to bring Google-quality predictive language to the masses. Aidan Gomez co-wrote a seminal 2017 paper at Google Brain that invented a concept known as "Transformers".
Startups competing with OpenAI's GPT-3 all need to solve the same problems - Last year, two startups released their own proprietary text-generation APIs. AI21 Labs, launched its 178-billion-parameter Jurassic-1 in Aug 2021, and Cohere, released a range of models. Cohere hasn't disclosed how many parameters its models contain. ... There are other up-and-coming startups looking to solve the same issues. Anthropic, the AI safety and research company started by a group of ex-OpenAI employees. Several researchers have left Google Brain to join two new ventures started by their colleagues. One outfit is named Character.ai, and the other Persimmon Labs.
Cohere Wants to Build the Definitive NLP Platform - Beyond generative models like GPT-3.
Transformer Inference Arithmetic technical write-up from Carol Chen, ML Ops at Cohere. This article presents detailed few-principles reasoning about LLM inference performance, with no experiments or difficult math.
State of AI Report 2022 - Key takeaways:

New independent research labs are rapidly open sourcing the closed source output of major labs.
AI safety is attracting more talent... yet remains extremely neglected.
OpenAI's Codex, which drives GitHub Copilot, has impressed the computer science community with its ability to complete code on multiple lines or directly from natural language instructions. This success spurred more research in this space.
DeepMind revisited LM scaling laws and found that current LMs are significantly undertrained: they’re not trained on enough data given their large size. They train Chinchilla, a 4x smaller version of their Gopher, on 4.6x more data, and find that Chinchilla outperforms Gopher and other large models on BIG-bench.
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key method to finetune LLMs and align them with human values. This involves humans ranking language model outputs sampled for a given input, using these rankings to learn a reward model of human preferences, and then using this as a reward signal to finetune the language model with using RL.

The Scaling Hypothesis by Gwern - On GPT-3: meta-learning, scaling, implications, and deep theory.
AI And The Limits Of Language — An AI system trained on words and sentences alone will never approximate human understanding by Jacob Browning and Yann LeCun - What LLMs like ChatGPT can and cannot do, and why AGI is not here yet.
Use GPT-3 foundational models incorrectly: reduce costs 40x and increase speed by 5x - When fine-tuning a model, it is important to keep a few things in mind. There's still a lot to learn about working with these models at scale. We need a better guide.
The Next Generation Of Large Language Models - It highlights 3 emerging areas: 1) models that can generate their own training data to improve themselves, 2) models that can fact-check themselves, and 3) massive sparse expert models.

Educational

minGPT by Andrej Karpathy - A PyTorch re-implementation of GPT, both training and inference. minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code.
- nanoGPT - It's a re-write of minGPT. Still under active development. The associated and ongoing video lecture series Neural Networks: Zero to Hero, build GPT, from scratch, in code and aspire to spell everything out. Note that Karpathy's bottom up approach and fast.ai teaching style work well together. (FYI, fast.ai has both top-down ("part 1") and bottom-up ("part 2") approach.)
A visual intro to large language models (LLMs) by Jay Alammar/Cohere - A high-level look at LLMs and some of their applications for language processing. It covers text generation models (like GPT) and representation models (like BERT).
Interfaces for Explaining Transformer Language Models by Jay Alammar - A gentle visual to Transformer models by looking at input saliency and neuron activation inside neural networks. Our understanding of why these models work so well, however, still lags behind these developments.
The GPT-3 Architecture, on a Napkin
PicoGPT: GPT in 60 Lines of NumPy

Tutorials

How to train a new language model from scratch using Transformers and Tokenizers tutorial by Hugging Face. 🔥

AI Safety

Interpretability research and AI alignment research.

Transformer Circuits Thread project by Anthropic - Can we reverse engineer transformer language models into human-understandable computer programs? Interpretability research benefits a lot from interactive articles. As part of their effort, they've created several other resources besides their paper like "A Mathematical Framework for Transformer Circuits" and "toy models of superposition".
Discovering Language Model Behaviors with Model-Written Evaluations (paper) by Anthropic et al. - They automatically generate evaluations with LMs. They discover new cases of inverse scaling where LMs get worse with size. They also find some of the first examples of inverse scaling in RLHF, where more RLHF makes LMs worse.
Transformers learn in-context by gradient descent (paper) by J von Oswald et al. [AI Alignment Forum]
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers (paper) by Microsoft Research.
Cognitive Biases in Large Language Models
Tracr: Compiled Transformers as a Laboratory for Interpretability (paper) (2023) by DeepMind - TRACR (TRAnsformer Compiler for RASP) is a compiler for converting RASP programs (DSL for Transformers) into weights of a GPT-like model. Usually, we train Transformers to encode algorithms in their weights. With TRACR, we go in the reverse direction; compile weights directly from explicit code. Why do this? Accelerate interpretability research. Think of it like formal methods (from software eng.) on Transformers. It can be difficult to check if the explanation an interpretability tool provides is correct. [Tweet, code]
Yann LeCun's unwavering opinion on current (auto-regressive) LLMs (Tweet)

Videos

BERTology

XLNet Explained by NLP Breakfasts.

Clear explanation. Also covers the two-stream self-attention idea.

The Future of NLP by 🤗

Dense overview of what is going on in transfer learning in NLP currently, limits, and future directions.

The Transformer neural network architecture explained by AI Coffee Break with Letitia Parcalabescu.

High-level explanation, best suited when unfamiliar with Transformers.

Attention and Transformer Networks

Sequence to Sequence Learning Animated (Inside Transformer Neural Networks and Attention Mechanisms) by learningcurve.

Official BERT Implementations

google-research/bert - TensorFlow code and pre-trained models for BERT.

Transformer Implementations By Communities

GPT and/or BERT implementations.

PyTorch and TensorFlow

🤗 Hugging Face Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. [Paper]
spacy-transformers - a library that wrap Hugging Face's Transformers, in order to extract features to power NLP pipelines. It also calculates an alignment so the Transformer features can be related back to actual words instead of just wordpieces.
FasterTransformer - Transformer related optimization, including BERT and GPT. This repo provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

PyTorch

codertimo/BERT-pytorch - Google AI 2018 BERT pytorch implementation.
innodatalabs/tbert - PyTorch port of BERT ML model.
kimiyoung/transformer-xl - Code repository associated with the Transformer-XL paper.
dreamgonfly/BERT-pytorch - A PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
dhlee347/pytorchic-bert - A Pytorch implementation of Google BERT.
pingpong-ai/xlnet-pytorch - A Pytorch implementation of Google Brain XLNet.
facebook/fairseq - RoBERTa: A Robustly Optimized BERT Pretraining Approach by Facebook AI Research. SoTA results on GLUE, SQuAD and RACE.
NVIDIA/Megatron-LM - Ongoing research training transformer language models at scale, including: BERT.
deepset-ai/FARM - Simple & flexible transfer learning for the industry.
NervanaSystems/nlp-architect - NLP Architect by Intel AI. Among other libraries, it provides a quantized version of Transformer models and efficient training method.
kaushaltrivedi/fast-bert - Super easy library for BERT based NLP models. Built based on 🤗 Transformers and is inspired by fast.ai.
NVIDIA/NeMo - Neural Modules is a toolkit for conversational AI by NVIDIA. They are trying to improve speech recognition with BERT post-processing.
facebook/MMBT from Facebook AI - Multimodal transformers model that can accept a transformer model and a computer vision model for classifying image and text.
dbiir/UER-py from Tencent and RUC - Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo (with more focus on Chinese).
lucidrains/x-transformers - A simple but complete full-attention transformer with a set of promising experimental features from various papers (good for learning purposes). There is a 2021 paper rounding up Transformer modifications, Do Transformer Modifications Transfer Across Implementations and Applications?.

Keras

Separius/BERT-keras - Keras implementation of BERT with pre-trained weights.
CyberZHG/keras-bert - Implementation of BERT that could load official pre-trained models for feature extraction and prediction.
bojone/bert4keras - Light reimplement of BERT for Keras.

TensorFlow

guotong1988/BERT-tensorflow - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
kimiyoung/transformer-xl - Code repository associated with the Transformer-XL paper.
zihangdai/xlnet - Code repository associated with the XLNet paper.

Chainer

soskek/bert-chainer - Chainer implementation of "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".

Transfer Learning in NLP

NLP finally had a way to do transfer learning probably as well as Computer Vision could.

As Jay Alammar put it:

The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It's been referred to as NLP's ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks).
One of the latest milestones in this development is the release of BERT, an event described as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks. Soon after the release of the paper describing the model, the team also open-sourced the code of the model, and made available for download versions of the model that were already pre-trained on massive datasets. This is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component – saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch.
BERT builds on top of a number of clever ideas that have been bubbling up in the NLP community recently – including but not limited to Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al).
ULMFiT: Nailing down Transfer Learning in NLP
ULMFiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULMFiT introduced a language model and a process to effectively fine-tune that language model for various tasks.
NLP finally had a way to do transfer learning probably as well as Computer Vision could.

MultiFiT: Efficient Multi-lingual Language Model Fine-tuning by Sebastian Ruder et al. MultiFiT extends ULMFiT to make it more efficient and more suitable for language modelling beyond English. (EMNLP 2019 paper)

Books

Transfer Learning for Natural Language Processing - A book that is a practical primer to transfer learning techniques capable of delivering huge improvements to your NLP models.
Natural Language Processing with Transformers by Lewis Tunstall, Leandro von Werra, and Thomas Wolf - This practical book shows you how to train and scale these large models using Hugging Face Transformers. The authors use a hands-on approach to teach you how transformers work and how to integrate them in your applications.

Other Resources

View

Papers​

Articles​

BERT and Transformer​

Attention Concept​

Transformer Architecture​

Generative Pre-Training Transformer (GPT)​

ChatGPT​

Large Language Model (LLM)​

Transformer Reinforcement Learning​

Tools for RLHF​

Additional Reading​

Educational​

Tutorials​

AI Safety​

Videos​

BERTology​

Attention and Transformer Networks​

Official BERT Implementations​

Transformer Implementations By Communities​

PyTorch and TensorFlow​

PyTorch​

Keras​

TensorFlow​

Chainer​

Transfer Learning in NLP​

Books​