English versionEN

An Introduction to Transfer Learning in NLP and HuggingFace

In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.

This talk has been presented at ML conf EU 2020, check out the latest edition of this Tech Conference.

FAQ

The main advantages of transfer learning are data efficiency and improved performance. It allows models to perform well with fewer data points by utilizing knowledge from previously learned tasks. This approach mimics human learning behaviors, making it a powerful method in machine learning.

Sequential transfer learning involves multiple steps, starting with pre-training on a large dataset to develop a general-purpose model, followed by fine-tuning or adapting this model for specific tasks. This method is widely used due to its effectiveness in improving model performance across various tasks.

Pre-training in NLP typically involves language modeling, a self-supervised learning objective where the model predicts the next word in a sentence given the previous words. This process leverages large amounts of unannotated text, allowing the model to learn language patterns and structures effectively.

Hugging Face is pivotal in democratizing NLP technologies by developing and sharing powerful tools and libraries like Transformers, Tokenizers, and Datasets. These contributions help the community to access state-of-the-art models, promote NLP research, and apply advanced models in practical applications.

BERT (Bidirectional Encoder Representations from Transformers) is trained using a mask language modeling task where random words are masked and predicted by the model. GPT (Generative Pre-trained Transformer), however, uses an auto-regressive approach where each word is predicted based on the previous words in a sentence. These training differences make BERT better for understanding context, while GPT excels in generating coherent text sequences.

Transfer learning in NLP involves using knowledge gained while solving one problem and applying it to a different but related problem. For example, a model trained on one language task can be fine-tuned to perform another language task, leveraging the pre-trained knowledge rather than starting from scratch.

machine learning

Thomas Wolf

32 min

02 Jul, 2021

Comments

Video Summary and Transcription

The video discusses transfer learning in NLP, focusing on its efficiency and data usage. It highlights Hugging Face's contributions, such as the Transformers Library, Tokenizer, and DataSets, which make NLP tools accessible. The speaker explains sequential transfer learning and the use of models like BERT and GPT. BERT is trained by predicting masked tokens, while GPT uses an auto-regressive approach. The video also covers challenges like out-of-domain generalization and model size reduction methods like distillation. Hugging Face's model hub and web interface are mentioned as resources for exploring models and datasets. The dialogue agent example illustrates practical applications of these models in NLP. Researchers are working on techniques like MixOut to avoid local minima in models. The video emphasizes the importance of starting with simpler models and scaling up as needed.

Available in Español: Una introducción al aprendizaje por transferencia en NLP y HuggingFace

1. Introduction to Transfer Learning in NLP

Short description:

Today, we're going to talk about transfer learning in NLP. In transfer learning, we reuse knowledge from past tasks to bootstrap our learning. This approach allows us to learn with just a few data points and achieve better performance. At Hugging Face, we're developing tools for transfer learning in NLP.

Hi, everyone. Welcome to my talk. And today, we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept of history, then present you the tools that we are developing at Hugging Face. And then I hope you have a lot of questions for me.

So, Q&A session. OK. Let's start by concepts. What is transfer learning? That's a very good question. So here is the traditional way we do transfer learning. Sorry. This is the traditional way we do machine learning. Usually when we face with a first task in machine learning, we gather a set of data. We randomly initialize our model and we train it on our datasets to get the machine learning system that we'll use to predict, for instance, to work in production.

Now, when we face with a second task, usually again, we'll gather another set of data. Another data set. We randomly initialize from our model and we'll train it from scratch again to get the second learning system and the same way we face with a third task. We'll have a third dataset, we'll have a third machine learning system, again, initialized from scratch and that we'll use in production. So this is not the way we humans do learning. Usually when we're faced with a new task, we reuse all the knowledge we've learned in the past tasks, all the things we have learned in life, all the things we've learned in university classes and we use that to bootstrap our learning. So you can see that as having a lot of data, a lot of data set that we've already used to generate a knowledge base. And now this gives us two main advantages. The first one is that we can learn with just a few data, just a few data points, because we can interpolate between these data points. And this kind of help us do some form of data mutation, if you want, naturally. And the second advantage is just... Is that we can also leverage all this knowledge to reach better performances. So humans are typically more data efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning, for machine learning. So we've done last summer, a very long tutorial. There was a three-hour tutorial.

2. Sequential Transfer Learning with BERT

Short description:

Today, we'll discuss sequential transfer learning, which involves retraining and fine-tuning a general-purpose model like BERT. Language modeling is a self-supervised pre-training objective that maximizes the probability of the next word. This approach doesn't require annotated data and is versatile, making it useful for low-resource languages. Transformers like BERT are commonly used for transfer learning in NLP.

So you can check out these links. There are 300 slides, a lot of hands-on exercise, and an open source code base. So if you want more information, you really should go there.

So there are a lot of ways you can do transfer learning. But today I'm going to talk about sequential transfer learning, which is the currently most used flavor, if you want to transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps, so at least two steps. The first step is called retraining. And during these steps, you'll try to gather as much data as you can. We'll try to build basically some kind of knowledge base, like the knowledge base we humans built. And the idea is that we can end up with a general-purpose model.

So there's a lot of different general-purpose model. You've probably heard about many of them, Word2Vec and GloVee were the first model leveraging transfer learning. They were Word embeddings, but today, we use models which have a lot more parameters, which are fully pretrained, like BERT, GPT or distilled BERT. And these models, they are pre-trained as general-purpose model. They are not focused on one specific task, but they can be used on a lot of different tasks. So how we do that, we do a second step of adaptation or fine-tuning usually, on which we will select the task we want to use our model for, and we'll fine-tune it on this task. So here you have a few examples, test classification, word labeling, question answering.

But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. So language modeling is a pre-training objective, which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a product of the probability of the words, for instance, and we try to maximize that. So you can see that as given some context, you will try to maximize the probability of the next word or the probability of a master. The nice thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high-capacity So this is great for many things, and in particular, low-resource languages. It's also very versatile, as I told you, you can decompose this probability as a product of probability of a various view of your texts. And this is very interesting from a research point of view.

Now, how are the models looking? There are two main flavors of model, they are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language modeling, which is a denoising objective.

3. Training BERT and Adapting for Downstream Tasks

Short description:

BERT is trained by predicting back the masked token. Another flavor is the auto-aggressive or causal model, which associates each token with the next token. These models are less powerful but train faster. To adapt BERT, the pre-training head is removed and replaced with a task-specific head. Examples of downstream tasks include text classification and generation.

So we take a sentence here, we mask one token, and we try to predict this token back. So we project all these words in vectors. Then we will actually use what is the attention layer in transformer, which will do a weighted average of these vectors to end up with also vectors. But these vectors then now depend on the context, which means that the vector associated to masked have enough information from the context to be able to predict back, or at least to try to predict back the missing, the mass 12, okay?

So this is how BERT is trained. You train this model to predict back the masked token. Now there is another flavor, which is called auto-aggressive or causal model. And these models, they look pretty much the same, but you can see that to one token here is associated the next token. So my dog, the token that will be at the end of the hyper column associated to dog will be, is the next token, okay? So we need to tweak the attention here. We have to mask the right context. Otherwise, the model can just see the label here on the right. So these models, they are inherently less powerful because they cannot use the to do their prediction, but they have a lot more strange signal here. So they usually train faster. So these are the two main flavors of model. These are the two main framework.

Now, how do you do the adaptation steps? It's pretty easy. Yeah. These are the two advantages I show you. The first step is that we will remove the pre-training head. So we'll take these pink boxes that I showed you on the previous slides. And we remove them and we will replace them with a task-specific head. So if you do text classification, it can just be a linear prediction to the number of classes that you have. If you're doing a more complex task, you can add a full neural network on top of it. If your task is very complex, like structurally different, for instance, you do machine translation, then you have to do more complex things. We'll talk about that a little bit in the next section.

So let me show you two main examples of downstream task that we can tackle, text classification and generation. So here is an example for text classification to kind of show you how it works. Let's say we have a sentence here, Jim Hansen was a puppeteer, and the task of our machine learning model is to predict if the sentence is true or false. So this is just binary classification, if you want, okay? So the first step would be to convert this input sentence in something that our model can digest. So our models they are, they can tackle integrals, floats, they need numbers.

4. Model Training and Performance

Short description:

Our models are trained to process open domain vocabulary and corpus. Uncommon words are handled by cutting them into prefixes and suffixes. The model's output is projected back to two classes using a linear classifier. This classifier is trained from scratch on the downstream task. In a practical example, we achieved over 90% accuracy after just one epoch on a small dataset. The model's versatility is demonstrated in a chatbot setting with a knowledge base.

So our models they are, they can tackle integrals, floats, they need numbers. So the first step will be to convert these strings into numbers. And there is one thing interesting here is that our models are made to be trained and to process basically open domain vocabulary, open domain corpus. They are made to be trained on internet. So we have to handle some uncommon words.

Here is perpetuer is a word that is definitely not very common in English. So the way we handle that is that we will cut this word in prefix and suffix until we know every step part of the word. Okay. So here, for instance, perpetuer are two kind of common prefix and suffix. So they are in our vocabulary and we can then just convert all this in our vocabulary indices. This is now the inputs of the model we saw two slides ago, the birth of GPT. And as you remember the output was a vector here, the output before the pink boxes was a vector. And we just have to project that back this vector to two classes.

So for instance, we pull them, put them into a linear classifier, and then we get the classes. So this linear classifier will be trained from scratch on our downstream task. Okay. So it has to be quite small in term of barometers if you want to be that efficient, but this one is pre-trained on the huge data corpus that we've gathered. So this one can be really used. Okay. So here are two, I mean, here is a practical example on a classification task called track six. So you can see more details in the slides of our tutorials last year, but basically you will see that just after one epoch, we already have an accuracy over 90%, which means that this is really that efficient. This dataset, this track six is really a small datasets, 2500 examples. So this is really that efficient. And when you train for three epochs, you get down to a narrow rate of 3.6, which is the state of the art. Well, it was the state of the art last year. So you get the two things that we're talking about, data efficiency and high performances.

Okay. Now here's a totally different example. So you see the variety, the diversity of tasks that this model can face. This is a chatbot setting where we have kind of a knowledge base.

5. Handling Different Types of Inputs

Short description:

The chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. The dialogue agent needs to generate a reply that makes sense given the personality. This is a different task because of the various types of inputs involved, such as the knowledge base, dialogue history, and the utility of the beginning of the reply. There are two main ways to handle this: concatenating all the inputs into a single input or duplicating the model and processing each type of input separately.

The chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. And now it's discussing with a user in an open domain setting. So the user said, Hey, chatbot answer, hello, how are you? And the user say, I'm good. Thank you. How are you? And our dialogue agent, our machine learning model is supposed to generate a reply, which makes sense given the personality. Okay? So you see, this is quite a different task because we have a lot of different type of inputs. We have kind of a knowledge base here. We have a dialogue history as well. And we have this utility on even the beginning of the reply because we've generating the reply word by word. So we have the beginning of the reply that we should also take into account to generate the next one. So how do we handle that? Well, actually we had the paper last year at ACL that you can check out. But basically you can do two main ways to handle that. You can concatenate all this in a single input. This is one simple way to tackle this and actually works really well. Or you can duplicate your model and have several parts that process each type of input, and then you need to connect them together with cross attention or a way to connect them. Okay. This is very different, but the idea is that overall, as you can see here, this is a competition we participated in at the GameFest two years ago now. And you see we were really leading on the automatic matrix by a strong margin using this type of transfer learning approach.

6. Trends and Limitations in NLP

Short description:

There is a trend to bigger models, which is narrowing the competition. However, there are ways to reduce the size of models through distillation, pruning, and quantization. These models have difficulties with out-of-domain generalization and are limited by the lack of common sense knowledge in text. Another challenge is making models learn new knowledge without forgetting previous knowledge. Despite these shortcomings, Hugging Face aims to tackle these challenges by democratizing NLP through knowledge sharing and open sourcing code and libraries, such as the Transformers Library.

So a few words on trends and limit, I'll go fast, but yeah, there is a trend to bigger models. This is a main, this is quite a big problem because it's narrowing the competition. There is some way we can reduce the size of this model, distillation, pruning quantization. I won't spend too much time on this, but it's something we are like a lot.

These models generalization problem. Even though they are big, they have really difficulties to tackle out of domain generalization. And there is basically a limit to text as a medium, which is that a lot of common sense is just not written down in text. And so this is a limit to just what this model can learn. And the main way we can overcome this is to use some database or image, or even to use human in the loop.

And now there is the last main problem, which is that these models, they are trained just one time, and it's very hard to make them learn new knowledge later on, because they have this problem of catastrophic forgetting. And so, for instance, GPT-3, which was very expensive to train. Train cost, like, 10 millions. GPT-3 has no idea what COVID is. Well, COVID is such a strong element of our daily life today, okay? So this is too bad. So there's a lot of shortcomings, and at Hugging Face, we try to tackle some of these shortcomings.

So what do we do at Hugging Face? Well, we try to democratize NLP. We started as a chatbot company, building a game, and we open sourced a lot of tools while we were doing that. And actually this tool catched so much interest in the community that we are now fully focused on catalyzing and democratizing all this research level work. So we do that in two main ways. The first way is through knowledge sharing. That's why I'm talking today. And the second way is to open source a lot of code and to open source some library, which lets people leverage all these developments and develop better models on top of them. Okay. So let's see a few of our open source libraries, and this will be the open door for the question in the second part. Okay. So the first main library that you probably know is called the Transformers Library, which is a way to access all the state of the art model, and they can be used both for NLU and NLG, many languages. We have today really a lot of models in the library. There is all the model, you know, probably BERT, GPT, and also more interesting model, more recent DPR Pegasus. XM Roberta is a great multilingual model by Alex and team. There are actually T5 is a really huge model, so there are a lot of them, very simple to use.

7. Model Hub, Tokenizer, and DataSets Libraries

Short description:

We have a model hub where people can share over 3,000 models in many languages. We also have a web interface to try out the models. We recently opened a second library called Tokenizer to solve the bottleneck in string splitting. We also opened a third library called DataSets to make it easier to share datasets and metrics. DataSets has interoperability with popular frameworks and is designed for large datasets.

We can talk about that in the question. We even have a model hub where people can share their model. We have more than 3,000 models and many, many languages, so if you want to use a model for Finnish or German, they have like 132 different models you can expect and you can play with them. Even in the web interface, you can try the model and see how they behave. This is at huggingface.com.

Now we opened so small recently a second library called Tokenizer, which do this first part that I was talking about, like splitting the string in integrals. So why did we do that? This was really a bottleneck in terms of speed. And so we decided to use some very low level rusk code, super fast to do that to solve this problem. So this is now really a very fast step. It's available in Python node. Rust is a great language.

And more recently, a few months ago, we opensource a third library, which is called DataSets. So DataSets is our third library. This is to tackle one last problem we discovered, which is that DataSets themselves, they're kind of hard to access for people, they're hard to share. And we spend a lot of time rewriting the same processing code that people have actually already written a lot of in there on that set. So we decided to make a library to solve this problem, make it easier to share DataSets, and also to share metrics. Some metrics right now, some novel metrics are actually very complex to use, or like to at least to install and to set up metrics, for instance, to evaluate natural language generation can be really complex. And so we decided it would be also good to make something very easy to share them, to upload them, to share them. So this is the DataSets library with that set of metrics. It has a few really cool stuff. For instance, there is building interoperability with NumPy, Panda, PyTorch, and Tensor flow too. So you can use basically any type of framework if you want, well, any type of modern framework. It's also made specifically to tackle really large datasets. So if you have gigabytes of data, if you want to, for instance, train your model on Wikipedia, and some data that your dataset may be, for instance, bigger than your RAM memory, you can use this library because it has a lot of smart way to do on-disk zero-sization. So it just take nine megabytes from, for instance, to train Wikipedia. There's a lot of smart caching. If you process your dataset once, it's already good, and you won't spend time reprocessing it again. So you can check this one. Here is an example of how you prepare. This is the full preparation of a training set to train a model on Glue, with the tokenization, with the padding.

8. Model Training, Libraries, and Questions

Short description:

The model training process is efficient and provides visibility. The hub allows access to datasets, including multi-modal datasets with images. Visit HuggingFace.co to explore the datasets and view their contents. These three main libraries are available for discussion, along with any questions about the concept and history.

And you see, this is just like 20 lines of code, and here you're ready to train this model. So it's made to be very efficient and everything be very visible at the same time. So, same as the model, we also have a hub where you can access and even explore the datasets. So there is like a model, 160 dataset now. There are also multi-modal datasets with images. And so you can go to HuggingFase.co, explore all the datasets. You can see the station and everything. And you can even see what's inside a dataset in the web interface.

Okay. So same here. You can go and check it out at HuggingFace.co. Okay. So these are the three main libraries, and I'm really happy to talk about any question you may have on them. And also, any question you may have on the first part, the concept, the history. Thank you.

QnA

NLP Advancements and Choosing Models

Short description:

Hey Thomas, how is it going for you? Good to hear! NLP has become the most exciting field in AI. GPT-3 is an impressive model, but its full capabilities are difficult to evaluate. It excels in retrieval tasks and can generate code and realistic blog posts. However, it lacks deep reasoning abilities. Having a big database doesn't equate to AGI. When choosing a model, start with something simple like a distilled version and scale up if needed.

Hey Thomas, how is it going for you?

Hi. Good, good. How is it going for you?

Yeah, it's also pretty nice. I mean, the sun is already a bit out in Munich, but it's still quite a bunch of exciting talks. And you did kick off this conference pretty nicely because I do believe that natural language processing or language in general is one of those indicators how good we are at understanding this machine learning.

How do you feel?

Yeah, definitely. I think what we've seen in NLP right now I guess over the last two years probably is that it has become really what we would have expected to become from the start, which means really the way to process knowledge and the way to kind of do researching or what we hope would be like researching. And so when we talk about AGI, I think right now a lot of people think about GPT-3, which is a full text model. So I think this is really an impressive thing about how NLP is now the most exciting field to be in AI.

Yeah, and it's kind of funny because you're almost predicting the first question right. And the first question is actually about GPT-3 and people asking like, hey, since you're an expert in NLP and your company is driving such good efforts in this regard, what do you think is the advance of GPT-3 in comparison to GPT-2?

Yeah, it's a good question. I think, well, one of the problems of GPT-3 is that it's quite difficult to access it. So I think we have not really evaluated the capability of GPT-3. Like, we've done it from other models like BERT or GPT-2 just because a lot of academics did not really have full access to be able to investigate what's happening. What can you do with that? Be able to test it fully on a lot of tasks. So it's kind of hard to give you really an answer, right? What I think is what we can do is that GPT-3 behave in some way like an interesting thing, which is a retrieval, like a smooth retrieval ranging over a really large data set. So you can do like, it's like having a huge Google Search, where you can search every page on the internet and being able to smoothly interpolate between all these pages. So I think this is very interesting and what we see, that you can do some pretty cool application with that, like you can smoothly interpolate to generate like also code and to generate like a realistic looking blog post. Now, when you talk about like really reasoning, meaning and things like that, I don't think there is really any deep breaks in GPT suite, but that's my personal opinion.

Yeah, no it's really good that at least for me it does resonate that you separate reasoning from having a big database. Because sometimes we have feelings that our community or parts of our community going like, hey, if database is bigger, you can just solve all of the problems. And sometimes it's not really, having a bigger model doesn't really mean that you're suddenly having like an AGI, and it's good to remember basically.

Yeah. So another question would be, just from the audience, there are so many different sizes of model, even transformer-based, there's GPT-3, there's transformers, there's Vertra, Roberta, all kind of things. There's also more distilled right? And when one machine learning engineer starts to work on the task, what is a good rule of thumb how to make this decision process? And obviously there is no one clear answer, but do you have any mental model or a framework, especially for beginners, who may be working in the company that doesn't have a big machine learning group, but they're person to actually make the calls, how can we help and support such a person?

Yeah, that's a good question. I think that's definitely something a lot of people face. I mean, the very practical thing about me, about our team at SwiggyFace, I think is that we should help people do that, because I understand we have like, we're providing a lot of models, we're providing a lot of checkpoints, but it's really hard to actually see the one you should select, the one you should use. So the first thing is that we will try to build some better tools on this. But now, for the quick answer, I think it's good to keep your good reflex, your good routine, is that you should start with something simple, like you should start with a smaller model, like starting with distilled birth, for instance, instead of birth, like starting with something small, and see how far you can go with that, with this compute-efficient model, like a distilled birth, a distilled GPT-2, distilled Roberta, you test a little bit with them. And if it's not enough, then you scale up, then you start to use bigger models and you try to see if you need, like, a T5 or something like that.

Starting with a logistic projection

Short description:

Starting with a logistic projection before moving to something more advanced is a typical approach. GPT-3 is a powerful model, and starting with it may not yield better results. It would be interesting to explore use cases that showcase the space of problems and how different models can be applied.

Yeah. Yeah, that's the typical thing. It's the same, you should always start with, like, logistic projection before you move to, like, something very fancy, but yeah. Yeah. Definitely. I can only agree with you, right? Because once you start with GPT-3, right, you can only make it, like, worse from there, right? So, no way to explain it, right? But maybe just like a random idea, because you're also building quite a bunch of tools, right? So right now, many companies building this, like a model, right? Maybe you can also have something as, like, a use case, right? But kind of, like, showing, like, hey, what is the space of problems, right? And how does it map, like, a space of models, right? And yeah, solution. But again, just like a random idea. Well, that's a very good idea. I think you should come build that with us. Yeah. Yeah, yeah. I will try.

Future of Transformers in NLP and Vision

Short description:

Transformers have revolutionized the NLP industry and are now being explored for vision tasks. They are efficient and scalable for transfer learning on large datasets. The choice of model is not as important as the underlying strategy of transfer learning. There is also a trend towards weakly or unsupervised learning. When using transfer learning, there is a risk of falling into local optima, but resources are available to guide advanced transfer learning in NLP.

Okay. So next question would be transformers, right? We have seen that transformers did reinvent or revolutionize the NLP industry at all, right? But how do you feel, like, about future of transformers? There was, like, some papers also about using transformers for, like, vision, right? Do you see that soon transformers are going to be like all around the machine learning or the learning in particular, or is that going to find their own niche, right? And kind of plateau from there basically?

Yeah, that's a good question. That's a timely question. I think this vision transformer was a very impressive paper because it actually showed that what we've seen for NLP was also true for vision, which is that transformers are quite efficient, and they are really a nice model to scale to large data sets. As soon as you start talking about transfer learning from a large data set, then you really want the most efficient architecture that you can scale. Yeah. I mean, I'm not a computer vision guy. So yeah, I don't really know why we only see that today, for instance. So I don't know if there was something unlocked, but yeah, I think they're nice. I'm pretty model agnostic, I would say. I think the underlying strategy, like transfer learning, is something very deep and important. Regarding the precise model that we select to use, I don't care that much. If somebody finds something better than Transformer, I would be happy to switch to these new models. But I think the idea of reusing pre-trained model is a lot deeper with deep consequences.

Yeah, I guess also something happening in the industry right now is not even using Transformer as an architecture, but more like using weekly or unsupervised learning. Because the big part of Transformer or BERT, or this direction, was like, hey, we don't need to have a big label dataset, we can just reuse what we have. And by using the same ideas of, hey, can we learn representation from the image itself, and all of the ideas of contrastive loss might be something that will spark another different direction. Cool, one more question from the audience. I will rephrase it, but I hope I'm not going to lose sensor. So what you suggested also in your talk, if you can use Transfer Learning, you can try to start there and try to unfreeze some layers and train it gradually to get the task right. And the person is asking, essentially, what is... how likely is it that by you doing Transfer Learning, you would end up in this local optima of convergence, so you don't get really something good? In comparison, if you start with a model from scratch and you start basically training it all together. Because, I mean, usually people hope that the more you unfreeze, the better you're getting. And it's almost like a smooth transition from really out-of-the-box Transfer Learning into this model that has been trained from scratch. Do you agree with this one, right? Or is there anything that people should be aware of? Like any tips and tricks basically of advanced Transfer Learning and NLP? Yeah, there's a lot of things here. The best resource, I mean, is probably... So, I had a link in my first slide on a long tutorial where we talk about that. And I also made a video I could share earlier this year on this question. But there is a lot of risk to fall in local optima. So, we see that when you fine-tune BERT for instance, you can get a non-off behavior.

Model Conversion and Stuck Minima

Short description:

For some random seeds, the model gets stuck in minima, while for others, it works well. Researchers are investigating this phenomenon and developing tools like MixOut to avoid getting stuck. There are various techniques to achieve smooth interpolation and avoid local minima. Assemblies are also effective. This is an open research question, and further exploration is encouraged.

For some kind of, for some random seed the model just doesn't convert, it's just stays stuck in minima. And for some other random seed it really works well, which was surprising in the beginning. And a lot of people have been investigating what was happening, and you have a lot of tools now to try to avoid that. You have tools like something related to what you're saying, which is called MixOut, which is the idea that's a bit like DropOut, you probably want to, like, you probably want to regularize your model. With DropOut you regularize right towards zero, like you cancel some weights. And with these other things, you can regularize towards the pre-trained model. So you can have a smooth interpolation from the pre-trained model to the model that is fine tuned. But there's a lot of other techniques that try to make some smooth interpolation to not be stuck in local minima. That's a very active area of research, and the other option is to try to do some assembles, which work very well as well. But yeah, that's still quite open research question. I think we have not really understand everything that was happening here on well. It's paying some time, while it's working very well in those cases, that looks really close. So maybe for whoever did ask this question, it's like a good topic also for research, right? So once you find some answers, right, please write the article or paper. And we all can read it later.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

TensorFlow.js 101: ML in the Browser and Beyond

ML conf EU 2020

41 min

TensorFlow.js 101: ML in the Browser and Beyond

Jason Mayes

Web AI Lead at Google.

TensorFlow.js enables machine learning in the browser and beyond, with features like face mesh, body segmentation, and pose estimation. It offers JavaScript prototyping and transfer learning capabilities, as well as the ability to recognize custom objects using the Image Project feature. TensorFlow.js can be used with Cloud AutoML for training custom vision models and provides performance benefits in both JavaScript and Python development. It offers interactivity, reach, scale, and performance, and encourages community engagement and collaboration between the JavaScript and machine learning communities.

tensorflow machine learning innovation tensorflow react

Transformers.js: State-of-the-Art Machine Learning for the Web

JSNation 2025

27 min

Transformers.js: State-of-the-Art Machine Learning for the Web

Joshua Lochner

Hugging Face

Joshua introduces Transformers JS and Hugging Face, emphasizing community collaboration and pre-trained models. Transformers JS evolution led to 1.4 million monthly users, supporting 155 architectures. The library's browser-based capabilities offer real-time processing, cost-efficiency, and scalability. Integration enhancements include native web GPU execution and React Native implementation. Web ML implementation focuses on Onyx Runtime for device execution and web GPU for resource optimization. Browser-based ML applications cover vision, speech recognition, and text-to-speech. Advanced implementations include multimodal applications and educational tools. Interactive AI demonstrations showcase semantic search and conversational AI scenarios. Model licensing transitions to ECMAScript for efficiency and model redownloading factors are discussed.

machine learning webgpu

Using MediaPipe to Create Cross Platform Machine Learning Applications with React

React Advanced 2021

21 min

Using MediaPipe to Create Cross Platform Machine Learning Applications with React

Workshops on related topic

Can LLMs Learn? Let’s Customize an LLM to Chat With Your Own Data

C3 Dev Festival 2024

48 min

Can LLMs Learn? Let’s Customize an LLM to Chat With Your Own Data

WorkshopFree

Andreia Ocanoaia

Feeling the limitations of LLMs? They can be creative, but sometimes lack accuracy or rely on outdated information. In this workshop, we’ll break down the process of building and easily deploying a Retrieval-Augmented Generation system. This approach enables you to leverage the power of LLMs with the added benefit of factual accuracy and up-to-date information.

machine learning artificial intelligence openai

Leveraging LLMs to Build Intuitive AI Experiences With JavaScript

JSNation 2024

108 min

Leveraging LLMs to Build Intuitive AI Experiences With JavaScript

Workshop

2 authors

Today every developer is using LLMs in different forms and shapes, from ChatGPT to code assistants like GitHub CoPilot. Following this, lots of products have introduced embedded AI capabilities, and in this workshop we will make LLMs understandable for web developers. And we'll get into coding your own AI-driven application. No prior experience in working with LLMs or machine learning is needed. Instead, we'll use web technologies such as JavaScript, React which you already know and love while also learning about some new libraries like OpenAI, Transformers.js

machine learning artificial intelligence openai

Let AI Be Your Docs

JSNation 2024

69 min

Let AI Be Your Docs

Workshop

Jesse Hall

Join our dynamic workshop to craft an AI-powered documentation portal. Learn to integrate OpenAI's ChatGPT with Next.js 14, Tailwind CSS, and cutting-edge tech to deliver instant code solutions and summaries. This hands-on session will equip you with the knowledge to revolutionize how users interact with documentation, turning tedious searches into efficient, intelligent discovery.
Key Takeaways:
- Practical experience in creating an AI-driven documentation site.- Understanding the integration of AI into user experiences.- Hands-on skills with the latest web development technologies.- Strategies for deploying and maintaining intelligent documentation resources.
Table of contents:- Introduction to AI in Documentation- Setting Up the Environment- Building the Documentation Structure- Integrating ChatGPT for Interactive Docs

frameworks machine learning artificial intelligence

Hands on with TensorFlow.js

ML conf EU 2020

160 min

Hands on with TensorFlow.js

Workshop

Jason Mayes

Come check out our workshop which will walk you through 3 common journeys when using TensorFlow.js. We will start with demonstrating how to use one of our pre-made models - super easy to use JS classes to get you working with ML fast. We will then look into how to retrain one of these models in minutes using in browser transfer learning via Teachable Machine and how that can be then used on your own custom website, and finally end with a hello world of writing your own model code from scratch to make a simple linear regression to predict fictional house prices based on their square footage.

tensorflow machine learning

The Hitchhiker's Guide to the Machine Learning Engineering Galaxy

ML conf EU 2020

112 min

The Hitchhiker's Guide to the Machine Learning Engineering Galaxy

Workshop

Alyona Galyeva

Are you a Software Engineer who got tasked to deploy a machine learning or deep learning model for the first time in your life? Are you wondering what steps to take and how AI-powered software is different from traditional software? Then it is the right workshop to attend.
The internet offers thousands of articles and free of charge courses, showing how it is easy to train and deploy a simple AI model. At the same time in reality it is difficult to integrate a real model into the current infrastructure, debug, test, deploy, and monitor it properly. In this workshop, I will guide you through this process sharing tips, tricks, and favorite open source tools that will make your life much easier. So, at the end of the workshop, you will know where to start your deployment journey, what tools to use, and what questions to ask.

machine learning

Introduction to Machine Learning on the Cloud

ML conf EU 2020

146 min

Introduction to Machine Learning on the Cloud

Workshop

Dmitry Soshnikov

This workshop will be both a gentle introduction to Machine Learning, and a practical exercise of using the cloud to train simple and not-so-simple machine learning models. We will start with using Automatic ML to train the model to predict survival on Titanic, and then move to more complex machine learning tasks such as hyperparameter optimization and scheduling series of experiments on the compute cluster. Finally, I will show how Azure Machine Learning can be used to generate artificial paintings using Generative Adversarial Networks, and how to train language question-answering model on COVID papers to answer COVID-related questions.

azure machine learning