An Introduction to Transfer Learning in NLP and HuggingFace

Rate this content
Bookmark

In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.

This talk has been presented at ML conf EU 2020, check out the latest edition of this Tech Conference.

FAQ

The main advantages of transfer learning are data efficiency and improved performance. It allows models to perform well with fewer data points by utilizing knowledge from previously learned tasks. This approach mimics human learning behaviors, making it a powerful method in machine learning.

Sequential transfer learning involves multiple steps, starting with pre-training on a large dataset to develop a general-purpose model, followed by fine-tuning or adapting this model for specific tasks. This method is widely used due to its effectiveness in improving model performance across various tasks.

Pre-training in NLP typically involves language modeling, a self-supervised learning objective where the model predicts the next word in a sentence given the previous words. This process leverages large amounts of unannotated text, allowing the model to learn language patterns and structures effectively.

Hugging Face is pivotal in democratizing NLP technologies by developing and sharing powerful tools and libraries like Transformers, Tokenizers, and Datasets. These contributions help the community to access state-of-the-art models, promote NLP research, and apply advanced models in practical applications.

BERT (Bidirectional Encoder Representations from Transformers) is trained using a mask language modeling task where random words are masked and predicted by the model. GPT (Generative Pre-trained Transformer), however, uses an auto-regressive approach where each word is predicted based on the previous words in a sentence. These training differences make BERT better for understanding context, while GPT excels in generating coherent text sequences.

Transfer learning in NLP involves using knowledge gained while solving one problem and applying it to a different but related problem. For example, a model trained on one language task can be fine-tuned to perform another language task, leveraging the pre-trained knowledge rather than starting from scratch.

Thomas Wolf
Thomas Wolf
32 min
02 Jul, 2021

Comments

Sign in or register to post your comment.

Video Summary and Transcription

The video discusses transfer learning in NLP, focusing on its efficiency and data usage. It highlights Hugging Face's contributions, such as the Transformers Library, Tokenizer, and DataSets, which make NLP tools accessible. The speaker explains sequential transfer learning and the use of models like BERT and GPT. BERT is trained by predicting masked tokens, while GPT uses an auto-regressive approach. The video also covers challenges like out-of-domain generalization and model size reduction methods like distillation. Hugging Face's model hub and web interface are mentioned as resources for exploring models and datasets. The dialogue agent example illustrates practical applications of these models in NLP. Researchers are working on techniques like MixOut to avoid local minima in models. The video emphasizes the importance of starting with simpler models and scaling up as needed.

1. Introduction to Transfer Learning in NLP

Short description:

Today, we're going to talk about transfer learning in NLP. In transfer learning, we reuse knowledge from past tasks to bootstrap our learning. This approach allows us to learn with just a few data points and achieve better performance. At Hugging Face, we're developing tools for transfer learning in NLP.

Hi, everyone. Welcome to my talk. And today, we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept of history, then present you the tools that we are developing at Hugging Face. And then I hope you have a lot of questions for me.

So, Q&A session. OK. Let's start by concepts. What is transfer learning? That's a very good question. So here is the traditional way we do transfer learning. Sorry. This is the traditional way we do machine learning. Usually when we face with a first task in machine learning, we gather a set of data. We randomly initialize our model and we train it on our datasets to get the machine learning system that we'll use to predict, for instance, to work in production.

Now, when we face with a second task, usually again, we'll gather another set of data. Another data set. We randomly initialize from our model and we'll train it from scratch again to get the second learning system and the same way we face with a third task. We'll have a third dataset, we'll have a third machine learning system, again, initialized from scratch and that we'll use in production. So this is not the way we humans do learning. Usually when we're faced with a new task, we reuse all the knowledge we've learned in the past tasks, all the things we have learned in life, all the things we've learned in university classes and we use that to bootstrap our learning. So you can see that as having a lot of data, a lot of data set that we've already used to generate a knowledge base. And now this gives us two main advantages. The first one is that we can learn with just a few data, just a few data points, because we can interpolate between these data points. And this kind of help us do some form of data mutation, if you want, naturally. And the second advantage is just... Is that we can also leverage all this knowledge to reach better performances. So humans are typically more data efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning, for machine learning. So we've done last summer, a very long tutorial. There was a three-hour tutorial.

2. Sequential Transfer Learning with BERT

Short description:

Today, we'll discuss sequential transfer learning, which involves retraining and fine-tuning a general-purpose model like BERT. Language modeling is a self-supervised pre-training objective that maximizes the probability of the next word. This approach doesn't require annotated data and is versatile, making it useful for low-resource languages. Transformers like BERT are commonly used for transfer learning in NLP.

So you can check out these links. There are 300 slides, a lot of hands-on exercise, and an open source code base. So if you want more information, you really should go there.

So there are a lot of ways you can do transfer learning. But today I'm going to talk about sequential transfer learning, which is the currently most used flavor, if you want to transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps, so at least two steps. The first step is called retraining. And during these steps, you'll try to gather as much data as you can. We'll try to build basically some kind of knowledge base, like the knowledge base we humans built. And the idea is that we can end up with a general-purpose model.

So there's a lot of different general-purpose model. You've probably heard about many of them, Word2Vec and GloVee were the first model leveraging transfer learning. They were Word embeddings, but today, we use models which have a lot more parameters, which are fully pretrained, like BERT, GPT or distilled BERT. And these models, they are pre-trained as general-purpose model. They are not focused on one specific task, but they can be used on a lot of different tasks. So how we do that, we do a second step of adaptation or fine-tuning usually, on which we will select the task we want to use our model for, and we'll fine-tune it on this task. So here you have a few examples, test classification, word labeling, question answering.

But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. So language modeling is a pre-training objective, which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a product of the probability of the words, for instance, and we try to maximize that. So you can see that as given some context, you will try to maximize the probability of the next word or the probability of a master. The nice thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high-capacity So this is great for many things, and in particular, low-resource languages. It's also very versatile, as I told you, you can decompose this probability as a product of probability of a various view of your texts. And this is very interesting from a research point of view.

Now, how are the models looking? There are two main flavors of model, they are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language modeling, which is a denoising objective.

QnA