English versionEN

[EN] Power of Transfer Learning in NLP: Build a Text Classification Model Using BERT
[ES] El poder del aprendizaje transferido en NLP: Construye un modelo de clasificación de texto usando BERT

Power of Transfer Learning in NLP: Build a Text Classification Model Using BERT

The domain of Natural Language Processing have seen a tremendous amount of research and innovation in the past couple of years to tackle the problem of implementing high quality machine learning and AI solutions using natural text. Text Classification is one such area that is extremely important in all sectors like finance, media, product development, etc. Building up a text classification system from scratch for every use case can be challenging in terms of cost as well as resources, considering there is a good amount of dataset to begin training with.

Here comes the concept of transfer learning. Using some of the models that has been pre-trained on terates of data and fine-tuning it based on the problem at hand is the new way to efficiently implement machine learning solutions without spending months on data cleaning pipeline.

This talk with highlight ways of implementing the newly launched BERT and fine tuning the base model to build an efficient text classifying model. Basic understanding of python is desirable.

This talk has been presented at ML conf EU 2020, check out the latest edition of this Tech Conference.

FAQ

Transfer learning is a machine learning technique where a model trained on one task is repurposed for a second, related task. This method leverages the knowledge gained from the first task to improve performance on the second task.

The presenter is Jayita Pudukonda, a Senior Data Scientist at Intelliint US Inc., based in New York City.

The main challenges in NLP include handling ambiguity, synonymity, coreference, and the syntactic rules of the English language. These complexities make it difficult for a computer program to understand and process human language accurately.

BERT (Bidirectional Encoder Representations from Transformers) is the first deeply bidirectional model trained in an unsupervised way on plain text. Its significance lies in its ability to understand the context of words in a sentence by looking at both the left and right sides simultaneously, making it highly effective for various NLP tasks.

Common applications of NLP include machine translation, chatbots, text-to-audio and audio-to-text conversion, building knowledge trees, intent classification, natural text generation, topic modeling, clustering, and text classification such as sentiment analysis.

Data preprocessing in NLP is crucial because it ensures that the data is clean, grammatically correct, and semantically meaningful. This step includes handling extra spaces, tokenization, spell check, contraction mapping, stemming, lemmatization, and removing stopwords. Proper preprocessing leads to better model performance.

Transfer learning is useful because it helps when there is a scarcity of labeled data, which can be expensive and time-consuming to create. It allows the use of pre-trained models on large datasets to be repurposed for smaller, related tasks, thereby saving resources and improving efficiency.

Word embeddings are feature vector representations of words in a training corpus. They capture the context of a word in a document, its semantic and syntactic relationship with other words, and its meaning. This helps in tasks like clustering similar words and understanding word analogies.

BERT handles the context of words by being bidirectional, meaning it looks at both the left and right sides of a word simultaneously. This allows BERT to understand the meaning of a word based on its surrounding words, improving the accuracy of tasks like sentence prediction and question answering.

BERT is trained on two main tasks: masked language modeling and next sentence prediction. Masked language modeling involves predicting missing words in a sentence, which helps in understanding the language context. Next sentence prediction involves determining if one sentence logically follows another, which helps in understanding the relationship between sentences.

machine learning

Jayeeta Putatunda

35 min

02 Jul, 2021

Comments

Video Summary and Transcription

This video talk delves into the intricacies of Transfer Learning with BERT in Natural Language Processing (NLP). BERT is a deeply bidirectional model that learns from plain text in an unsupervised manner, making it adept at understanding word context. The talk highlights its applications in tasks like text generation, machine translation, chatbots, and topic modeling. It also covers essential text preprocessing techniques, such as tokenization, stemming, and lemmatization, which are crucial for effective NLP. Transfer learning is emphasized as a way to leverage pre-trained models for new tasks, particularly when labeled data is scarce. The video also discusses the importance of clean data and understanding baseline NLP modeling. BERT's ability to handle context-based words and synonyms is explored, along with its limitations in strict classification labels or conversational AI. The talk concludes with practical advice on understanding transformers and BERT by focusing on the problems they solve and gradually exploring their implementation in business scenarios.

Available in Español: El poder del aprendizaje transferido en NLP: Construye un modelo de clasificación de texto usando BERT

1. Introduction to Transfer Learning and NLP

Short description:

Hello, everyone! Welcome to this session of Transfer Learning with Burt. Today, we'll learn about Transfer Learning and NLP. NLP is a subfield of linguistics and AI. We'll discuss its challenges and how it can be used for text classification.

Hello, everyone, and welcome to this session of Transfer Learning with Burt. With me, I'm very excited that you all are here and join me in the session. So let's see what we learn about Transfer Learning and Burt today.

Before we start, a quick introduction about me. Hi, my name is Jayita Pudukonda. I work as a Senior Data Scientist at Intelliint US Inc. We are based out of New York City. And to give you an overview, Intelliint, we are a customer driven professional services company. We specialize in tailor made software and tech solutions. We work with a lot of data analytics and AI based companies and build some of our tools and softwares around those states. So you can connect me on Twitter and LinkedIn if you have any questions about the session later on and we can discuss about that.

Great. So for today's session, we're going to talk about like 20 minutes for the session and then we have some Q&A. I'd also have some code that I can share over my GitHub. So do reach out if you want to take a look at those later after the session. So now the exciting part, I say this as an NLP is hard and you would say why. Now take a look at this picture. What's the first inference that comes to your mind? I know like human beings can make like, you know, great connections and references. This is a very trivial task for us. But if you think from a perspective of a computer program, this is a very daunting challenge to kind of understand this complexity of English language. So here in the picture it says, I am a huge Metal fan. So us like humans, we would know that, okay, this is the Metal fan is like, you know, a personifying electric electric component and says that, okay, I'm a metal fan, but it can also refer to that you are a huge metal, you know, music fan. So how do you, how the computer differentiates between these two meaning of the same terminology. So there's ambiguity, there's synonymity, there's coreference. And of course the syntactic rules of English literature, that kind of hampers, or makes it more daunting task for a computer program. So for today's agenda, we'll just quickly go over what's NLP, how, where it is used, how transfer learning can be used. And we look at a simple case of utilizing part for a text classification model. So let's jump right into it. For NLP, I feel that this image very well describes it. It's a subfield of linguistics, artificial intelligence.

2. Introduction to NLP Applications and Techniques

Short description:

There's computer science and also in information engineering. NLP has had an exponential growth in the last two years. It's used in machine translation, chatbots, intent classification, text generation, topic modeling, clustering, and text classification. To work with NLP, we need to handle extra spaces, tokenize text correctly, perform spell check, and use contraction mapping.

There's computer science and also in information engineering. So basically helps all the machines to understand and, you know, communicate back and forth with human beings in a free-flowing speech without losing context or references. And if you see NLP has had an exponential growth in the last, I would say two, two, years, like the huge models that Google, OpenAI and Vidya has been working on and releasing has like humongous amount of parameters. So just in May, 2020, OpenAI came up with a 175 billion parameter model, which is, you can understand how much text has gone into the processing of that model and how much work that it can do with so much accuracy.

So where is NLP used? I know that most of you are definitely familiar with it, but I just wanted to give a quick description of where I feel that it's being used the most. And I've worked hands-on on those areas. So definitely machine translation, like text to audio, audio to text, there's chatbot, building of knowledge trees, intent classification, there's also, you know, natural text generation. I'm sure when you use Gmail, you have seen that the prompt that keeps coming up when you're writing an email, says that, oh, that this two next words, I think would be good for the sentence that you're trying to complete. That's like, you know, text completion prompts that you get. It's also used a lot in topic modeling, clustering, understanding with the context of the whole or what kind of insights can be generated from huge amount of text data. And also text classification, which is like, you know, do you want to do a sentiment analysis? Like how do you understand what the general idea, say Yelp reviews or Amazon product reviews. So those have a lot of implications and good applications by NLP problems.

So how do we do it? I know that, these can sound a lot, underlying, but this is very important that we do it in all kinds of business cases, or business problems that we're trying to solve using NLP, so just a quick idea. We need to handle when we have, say, all our texts, make sure that we handle extra spaces. Then we also need to look after how we tokenize our text. So, tokenization just using, by spacing is the very traditional norm, but we also need to take care of use cases, like say, the whole world as United States of America, if we tokenize it just by the spaces, sometimes it can happen that that doesn't make sense in the context that we're trying to work through it, right? So then we need to keep that whole United States of America as a whole phrase, rather than tokenize it by a sentence, so then the, so that the information extraction works much better than, than otherwise if it's tokenized by word. The next step would be spell check. So I'm referring here directly to a very, like a great tool that Peter Norvig from Google created. It's a spell checker. Basically the idea is to, you know, kind of compare the distance between multiple words. And see that, okay. Does this spelling make sense and how close it is to a similar spelling or a similar word meaning that's there in the, you know, the vector space of the whole NLP corpus. So you can see here that when I pass the wrong spelling with a single L, the return value would be a spelling with a double L and also for corrected with the K not with the K and the final value would be corrected with a C. So this can actually help in making sure that our data is clean. And like they say that in NLP it's like garbage in and garbage out. Sorry about that garbage in and garbage out. So we need to make sure that the clean data makes sense with a grammatical sense. Syntactic sense and also semantic meaningfulness. The next step would be contraction mapping. This might seem that why we need to do it.

3. Tokenization, Stemming, and Lemmatization

Short description:

Tokenization may not handle contractions like won't and can't correctly. Stemming and lemmatization help maintain corpus metrics and identify context.

But then the understanding that, okay, if you're using a word like won't and can't the tokenization may or may not be able to handle it very correctly. So then we need to expand it and map it back to its full words like won't, will not and can't, cannot. The next step is definitely stemming and lemmatization. So this is just to ensure that we're keeping our NLP corpus metrics as tightened as possible using similar words like gaming, games, games, has the base word as game. So the idea is to keep the base word as our word and then so that the meaning around that remains the same. And the context can be identified in a better way.

4. Stopwords and Understanding Data

Short description:

Stopwords are extra words like articles and pronouns that may not add much context to a sentence. Removing them depends on the specific use case. Legal or patent datasets may not benefit from removing stopwords. Understanding the data and its structure is crucial before modeling.

The next is stopwords. Definitely. So, you know, when we speak in a correct way, say, when you're scribbling a lot of text from Twitter, Reddit and all other news blogs, there is a lot of extra additional words like the articles, and there is uh, and all the other pronouns that may or may not sometimes add much context to the whole sentence. So in this way we can get rid of them. But this, I would also like to mention that this is very much a case to case basis. Like if you have a good amount of data, like say a legal data set or a patent data set, that may or may not work well with removing all these stop words. So we need to understand our business use cases and make sense, make sure that it makes sense to remove those stop words. And of course there's like case based things that you can do, like you can tag for what kind of, is it a verb, is it a noun, like that would help you to understand your text insights much better before jumping into, you know, start modeling. So that's why it says that you should know the data or the unstructured text is more difficult to understand. So in order to make sure that once you're jumping into the modeling sections, before that you have like your complete understanding and insights about what kind of data you have at hand that you're dealing with.

5. Understanding Transfer Learning

Short description:

Transfer learning is a machine learning technique where a model trained on one task is repurposed for a second task. It involves utilizing the knowledge from a previously created dataset and retraining the last layer to classify new data. This technique can be applied in various scenarios, such as image classification and speech recognition.

Great. So now that we have cleared all the baseline techniques, let's jump right into and think about what's transfer learning. So this is a great quote by Andrew Engie, of course, he's a huge figure in the AI ML space. And in 2016, he mentioned that transfer learning will be the great next driver of ML success. And we are right now in 2020. And I think that, and you would see that so many problems or so many models have come out to solve different kinds of models via transfer learning.

So what is it? So a quick understanding for it is just nothing but just a machine learning technique, where a model is kind of trained on one task and it's kind of repurposed for a second task. Now, we'll see what that means. So let's say you have a dataset, right? You have a dataset one, which is like, say, a general image dataset and why the target variable is a classified object. That object could be, you know, a cat or a dog, a tree, a bus, or a truck, anything, anything. Any worldly objects that's already there in the image dataset. An example of it is ImageNet. I'm sure all of you have worked a little bit or have tried your hands on with this dataset in your work. Now, this, in a general view, is like a structure of neural network with few hidden layers and input layers and an output layer. So the output layer for the dataset one problem would be the, you know, if the classification between if it's cat or dog or any other particular object from the training dataset.

Now, think about the second case. Now, the second case that you have a dataset two and that's a very small dataset. You do not have a lot of training cases, but you have few. And the target data set is if it's an image of an urban image or if it's a rural image. Like, so what do you think is the common, you know, connection between these two datasets? The datasets is like both of them deal with image, both of them have, you know, a lot of worldly objects captured into it. But the final goal is exactly not the same. So here, transfer learning comes into action where you can, you know, kind of utilize the knowledge from the data set one that you have created in your first phase of modeling and just, you know, retrain the last layer. Here, it's the layer four if you can see, I mentioned is the last layer dense layer, retrained the last layer weights using softmax and class and like, you know, create or change those a final output layer from classifying it into CAD dog ship tree, et cetera, to classifying it as urban or rural based on the new data set that we have intertwined with level four. Now this can seem that, okay, why are we doing this? We are before that. Let's see, where are we doing this? So like, say, even in speech recognition, if you're saying a scraping or voice data from different sources like YouTube, and that's an English or Hindi, which is like, you know, one of the languages that we speak back in India. And then you pre train your network. Now, say you have a very small data set in Bengali and you want to kind of, you know, translate or understand what the speech is. So you can utilize the model that you have trained with a bigger data set in English and then use like, you know, translation. And then you utilize that for speech recognition, which isn't Bengali. The second image, the second example is what we spoke about right now.

6. Introduction to Transfer Learning

Short description:

Transfer learning is used when there is a scarcity of labeled data. Instead of creating a new labeled dataset, we can utilize a pre-trained model that has already solved a similar problem. The input properties of the tasks should be similar.

So like if you have an image network, millions of tagged images of general objects of the world, you can utilize that. Or, you know, image recognition, which is like urban versus rural, and so on. So why do we use transfer learning? So that's because, is there, like, a positive gain that we want to achieve? Yes, first, is that if there is a scarcity of labeled data. We understand that sometimes creating labeled data set is like very expensive. We need to, you know, connect, spend some time and expert resources on subject matter SMEs, who kind of help us in creating those tagged datasets. So, if you do not have that to start with, and it can get expensive, what can we do in our, that's in our hand, is that we try and find out, if there's already a bigger problem that has been solved, or a pre-trained model that has been solved. And there comes, if there's always, if there's already a network that exists, I would say with a massive amounts of data and it's pre-trained that we can utilize that's like the next best scenario we can do. And of course, the task one and task two should have kind of, you know, the same input properties. Is this what I mentioned? Like in, for example, we spoke about, we had, you know, a data that had a, like cat dog and other worldly objects. And the second would be the rural versus urban, but the input features or components kind of remained the same.

7. Introduction to Transfer Learning and BERT

Short description:

In transfer learning, we leverage a knowledge base from two source tasks to achieve a target task. BERT is a key component in this process. Word embeddings create connections between words based on their context. Clusters of related words help the model understand their proximity. Burt Mountain is a term coined by Chris McCormick, who emphasizes the importance of understanding transformer attention for applying BERT.

This I'm sure all of you have seen this image before. This is just the difference between a traditional ML learning model, where you have different tasks and you create, you know, different learning system for each task. But here in the transfer learning, you have, you know, two source tasks where you create a model, you have a knowledge base for it. And the idea is to how can we leverage that knowledge base to achieve a target task in the case that we discussed was urban versus rural.

Now, BERT comes into play here. And before we understand what BERT is, let's talk a little bit very briefly about what's word embeddings. Now, word embedding is nothing but, you know, a feature vector representation of the word. That you have in your training corpus. What does that mean? So the underlying concept of utilizing context in word is to create a connection or references between multiple words. So the famous example is like, you know, man is to king. So what is woman is to what? Is to. So the answer would be queen. So that kind of relations happen when we map words, utilizing different sentences in the context of word embedding.

So the next example I just want to highlight is that if you see here, there are multiple words and you can see tiny clusters and you can see that fish, eggs and meat kind of fall under a similar cluster. This is a simplified version in 2D. And in real case, it's much better with multidimensional matrix representation. But here, if you see another example, it's like heat, electricity, energy, oil and fuel kind of is like a cluster, a mini cluster and its own. So that makes sense to the model that OK, if these words, one of these comes, then we can say that the distance between these words are similar and they are closer to each other. In this in this example, it just shows that helicopter, drone and rocket are much closer to each other versus helicopter to goose. Whereas goose, eagle and bee are much closer to each other than they are to drone or rocket. OK. So this is a very famous word, Burt Mountain, like I called it, and Chris McCormick has termed made this terminology. So what he mentions is that what has been made on top of a lot of technology that we have been seeing developing over the years. But if you are studying right now and are trying to understand LLP and Burt, don't think that it's a big domain of making. You have to do a lot of research to it and it's impossible to learn. It is not. So he says that, OK, Transformer attention, these are the two concepts that we need to understand. And that will give you a good baseline to start applications of Burt and fine tuning it. Now, sometimes this can seem like, you know, big, daunting terminology. So how I feel is that, you know, Transformers can be a heavy concept, but they are there to help us.

8. Introduction to Burt and Bidirectionality

Short description:

Burt is the first deeply bidirectional model trained in an unsupervised way and in plain text. It learns information from both sides of the input and considers the context of tokens during training. Burt's structure is token-based, segment-based, and position-based. It can understand different meanings of words based on context. Burt has been pre-trained on tasks like masked word prediction and sentence prediction. Attention is a concept that helps in understanding the context of words.

So just like Optimus Prime, and don't get derailed by it. But just think about it in the simple concept in how we can explain it is you have, say, input criteria. So there is like a layer of encoders that would take those input. And then there's decoders that will transfer those inputs into the outcome that you want. Like, say, here the translation is happening from one language and the output language is a different language.

So basically, what is Burt? So Burt is is the first of its kind. It's the first deeply bidirectional model trained in an unsupervised way and in a plain text. So when I say bidirectional, what I mean is so if you have say, multiple sentences, some of the initial models would do it sequentially. First, they would learn from left to right and then they would learn from right to left. But Burt is a single bond that does it at once. How this, or what this helps us in achieving is that it makes sure that it learns information from both the sides together and takes a particular tokens context into reference during training without losing out any additional information about what what are the tokens that's available on the left side and what are the tokens available on the right side.

So just to give you a quick understanding, this is a very famous example like if you look at these two sentences, it says that we went to the river bank and it says that I need to go to a bank to make a deposit. So the word bank here has different, grammatically, it's the same. It's a word called bank but its meaning is completely different based on the context words that it has been used with. Now Burt can quickly understand okay, which is the river versus it's a bank, it's a financial institution that it's connected to. So this is the quick layer or I would say the structure of how Burt has been arranged in. It's token based, it's segment based, and it's also position based, that which position you are in the whole sentence and if the token is in the first sentence, if it's a multi-sentence corpus and of course the token sentence. These are the two tasks I just quickly walk through over. I know we are closing on in time, but we can discuss more in the Q&A section if you have any particular questions about it.

So, you know, if the idea is for the first one to understand the strength of bidirectionality is that the man went to the dash and he went to the dash of milk, which is we are masking the words here, and the labels that the bird is trying to produce or predict is that the first sentence would be store, and the next word would be gallon. And the next use case that bird has been pre-trained on is like sentence prediction. Like, if you see the first sentence, the man went to the store. And sentence B is he bought a gallon of milk. So that makes sense, right, those two sentences fall with each other and this label would be his next sentence. But if you see, the second set is that the man went to the store and the penguins are flightless, though they're grammatically correct in their respective sentences, but they do not belong with each other. So then the label would be not next sentence. So it's predicting if the next sentence or the next token makes sense with one that you have in hand or not. So this is the concept of attention. Attention is nothing but if you look at the word the animal didn't cross the street because it was too tired. I have given the link by Jay Allwar.

QnA

Discussion on BERT and NLP Problems

Short description:

He talks about attention and the importance of utilizing pre-built AI models for small daily tasks. Clean data and understanding the baseline NLP modeling are crucial. Thank you to MLConf and the fellow speakers. Let's answer some questions. What kind of NLP problems does BERT not work well on? It depends on the business problem. Albert and Bert Light can be effective for question answering classification.

He talks about it in a very particular way. So you can look at it and understand about attention more and we can discuss more about in our Q&A section.

So I'm not going to look over the code at this point, but I just want to highlight one more factor that, you know, training a particular AI model like this is a report by MIT, and it says that during a particular I model can emit as much carbon as, you know, five cars in their lifetime. So the idea of transfer learning is to see if we can utilize it already pre-built and use that in our small daily tasks, rather than retraining again and again for particular each use cases.

So, as closing remarks, it is difficult, but we need to make sure the data is clean, understand our data better than anything else. It's not magic and always advanced neural networks are not the answer to all your problems. Start with the baseline NLP modeling, and then go step by step up. These are some of the additional resources you can look through and read through some of the blogs. Thank you to MLConf for having this brilliant event. Thank you to all the fellow speakers and you guys who joined us. Thanks. I'm ready to take any questions if you guys have, and do definitely check out the Dell Ensport portfolio. I'm sure you'll find a lot of good use cases and case studies that we've worked on in the industry. Great, I think, thank you to all of you and thanks for joining. Let's answer some questions, if you guys have any. Hey Jay, how are you doing? Hi, Ajay, how are you doing? I'm doing fine. Fantastic, fantastic, fantastic. I know, I understand that you're also in the United States. It always comforts me to know that someone who I'm talking to is very close to home, only a few thousand kilometers away. Yeah, the other side of the country. East coast, west coast, doesn't matter. It's cool. All right, so with that, let's actually jump into some Q and A's by our lovely audience. So here's actually a question from a. Here's a question. OK, so what kind of NLP problems does BERT not work well on? Any thoughts on that? Yeah. So I guess the best idea or like once you start learning step by step, sometimes utilizing or like looking at some of the state of the art measures would kind of give you an idea, but I know for a fact why I chose BERT for classification, it may or may not be the. I guess it's not the optimal algorithm that you can use for a classification technique, but I guess for all kinds of question answering, like where you have a factual kind of a data set, where it's kind of making sure that it's reading from the blocks of data that you're training up in, it kind of works better in that, but for classification, I guess, you, again, it sometimes kind of depends on what kind of business problem you're trying to solve. So what I've seen working with it is that for Albert kind of gave me a pretty good score even, and even Bert Light gave a good score for a tiny question answering classification that I was working on very recent project. And that kind of helps.

Applications and Limitations of BERT

Short description:

Some applications of BERT can be used to solve multiple problems, depending on the data and mapping. The state of the art results can provide a head start. BERT may not perform well in strict classification labels or conversational AI. There is ongoing development in the industry to improve BERT's performance.

And I always feel that there is no good, correct answer to it. Some of these applications are built for multiple problems, to solve multiple problems, and it kind of depends on what kind of data you have, how you are mapping it, and what's the end result. So looking at some of the use cases and what are the state of the art results for those models, I guess can filter those weeds out of what kind of problems you may or may not use these techniques for, but definitely can give you a head start.

So even for the classification problem, it just gave me a very baseline score of say 0.6, which is not great, but it's good to start. And that's where you get your hands on, and say, okay, how can I tune it on, and maybe add more parameters, or what kinds of feature engineering I can do to improve that on.

Yeah, that actually is a very good point, because I do know that, like you mentioned, I don't think BERT could perform well if there's very strict classification labels in question answering, and very direct answers. But I'm thinking, like BERT might not be the best. At least we're not at the stage yet where it's very good in conversational AI, where I think I've seen some video examples where BERT kind of gives a little funky answers here and there, especially if the responses tend to be very long-winded. Sometimes it doesn't make sense at times. I guess that's why the problem of the clarify or maybe answering through tension is the goal at the end. And I guess the big people in the industry are still working on it, and there's so much in process that they're still developing, so I guess one day we can say that oh, put everything to BERT, and BERT will solve it. I'm waiting for that day to come. One day, one day. Let's hope, and it will come soon.

BERT Training and NLP Challenges

Short description:

BERT is trained on next sentence prediction and mass language modeling to handle synonymous words, contextual understanding, and coherent mapping. It solves the challenge of capturing essence in bi-directional training and ensuring meaningful context between sentences. Mass language modeling helps BERT understand the language itself, while next sentence prediction focuses on contextual understanding. The main challenge in working as a data scientist in NLP is dealing with unstructured text data and addressing issues like context-based words, synonyms, and spelling errors.

Alrighty, so another question. You know, I think you mentioned towards the end of your presentation or something like that, that we will attack this in the Q&A session, so I'm going to ask you this now. Why do we train BERT on the two problems of next sentence prediction and mass language modeling specifically, and why not, like, any other tasks? Yeah, I guess so, that actually is the answer to what we are trying to solve, right? Let's start from the thought process of word2vec. So word2vec is also a good way of doing this language training model, but what it doesn't serve is that it doesn't help us in creating that contextual and understanding that how to handle synonymous words. Like sometimes it will map it correctly. Sometimes the use case example that I showed for a bank, a river bank and a financial bank institution, how do you correctly segregate those same terms based on context? So for MAST, what happens is since we are doing bi-directional training with BERT, the idea is to capture all the essence, and if we do not do it by the MAST way, what will happen is that BERT will be able to kind of, because it's reading from both sides, will be able to map it or kind of like cheat in its own way to kind of see what the next word could be. Yeah, it actually happens in that way. So I guess the MAST way is just to make sure that we are restricting it to kind of not see the solutions in all its training, bi-directional training, but then trying to infer it from some of it that it has seen versus mapping it to some of the other ways. So I guess that answers one of the core challenges that we are trying to solve, versus the other models. And the second one for the next sentence, I guess it's very important that in order to understand context and think of it from the perspective that if you want to do a summarization, if you if you're doing an abstract summarization, how do you know that when you are summarizing it getting the essence of all the words or all the sentences from the whole block of training corpus that you're passing it through? So I guess the next sentence is just to ensure that we are not mapping it incoherently, even sometimes because sentences can be grammatically correct on its own, but it may not mean or it may not create semantic meaning when placed next to each other. So when placed next to each other, it should make sense. And that continuity flow of meaning should be kept within that goal analysis. So I guess from developing from all the models that we have seen over the time, I guess part is trying to solve these two solutions in a hands on core way. And that's how I infer it and see it working in all the solutions that it's kind of applied to. Yeah. So and yeah, actually, that's also another good point, because I guess, overall, like so mass language modeling, it's like Bird is just trying to understand the language itself, whereas next sentence prediction, it's trying to understand context within sentences or between sentences for like larger inputs. Right? Even if not only the direct one, but also like how it maps into the workflow of it. Let's say a sentence one and sentence three makes sense. But then sentence two doesn't belong with sentence three. So how do you keep that conversational or meaningful? Like it all comes down to attention again, like what you're talking about in the context. Is this following a correct pattern in the way it's describing through it? So, yeah, it's one of the key, key features of Purposeful. Very fascinating.

Another question here from our audience. What is your main challenge or difficulty when you started working as a data scientist? Yeah, the challenge still remains. It doesn't go away. The challenge is like, you know, so when I started working with NLP Text data, the first was like, you know, it was too unstructured. How do you actually pre-process it to make sense of the data. Because in NLP, everything makes sense that if you put in garbage, you are going to get out garbage. How there was challenges of fixing, you know, context based words, usage, how to attach synonyms, how to connect to or fix spelling errors. Spelling errors is a big issue in unstructured NLP problems.

Understanding Transformers and BERT

Short description:

Sometimes domain-specific spellings require working with SMEs to understand the domain knowledge needed. Data preprocessing is crucial for NLP tasks and leads to better models. Understanding transformers and BERT can be achieved by focusing on the problems they solve and gradually exploring their implementation in business use cases. Core understanding, step-by-step learning, and applying the concepts help gain proficiency.

And sometimes it happens that since they are like very domain specific spellings, you may or may not be able to use any generic spell checkers that's out there. So you need to kind of work with SMEs, try and figure out what's the domain knowledge that you need. Can they give us some corpus or metadata or tags that they have or the taxonomy that they have in the back end?

Sometimes it doesn't work out that way. So it does need a lot of explorating and understanding the data yourself to kind of see how you can fix it. So yeah, I guess the data pre-processing for any NLP task is a very crucial part and that kinds of leads you to making a better model. Otherwise, it doesn't matter how much feature engineering you do or how much fine tuning you do. It will not work out and you're going to be stuck in a maze of what am I doing wrong. If your model is only as good as your data is so garbage in, garbage out. Absolutely.

Alrighty. So here's another question. So, you know, you showed this BERT pyramid and that was very interesting. And like you said, just learning all these concepts shouldn't be that daunting. But like when I look at that pyramid, I see like, OK, you know, BERT I need to know transformers. And to know that I also need to know attention. And to know that many papers on LSTMs, RNNs and understand that. And eventually you can see how you can get into this rabbit hole where like when you're just studying one topic, you have to read two hundred papers.

So how do you understand transformers, and BERT specifically, without being bogged down by these very, very hard core technical details? But you need to be understanding. So how do I do this? Yeah. So I would say the first way of doing it is to just understand what it's trying to solve. Like say that, what are the transformers trying to solve? The first thing is encode, decode, and then create that part of transformation of data. What is attention trying to do? It's just basically trying to create context within contextual data saying that, OK, this word belongs to this, this word, or this noun, or this objective that it's referencing to. These are the two general ideas. Now you need to map it up. I understand that the underlying, there has been a lot of in process ways that have been developed. But then if you understand just this core understanding and then maybe read up on some very, I've mentioned some of the blogs by GLMR and Chris McCormick, take a look at how they have implemented in their business use cases. If you understand what they do and if you don't want to get stuck down the rabbit hole to understand all the math behind it at once, I would say do it step by step. But understand what it does first, then see what kind of business scenarios it can solve, what kind of data you need for that scenario. Once you start applying it a little bit and read up step by step what it's doing, then that's how you kind of get a hang of it. If I'm stuck at this position, what are the fine-tuning options do I have? What are the additional features I can create to kind of map that problem? It happens a lot.

Exploring Challenges and Q&A

Short description:

If you look up any topic, you'll find thousands of reading materials and thousands of papers associated with it. It's trying to solve some of the challenges that Word2Vec cannot handle. The internal steps involve mask modeling and sentence prediction. Unfortunately, we're out of time for Q&As, but Jaytha will be available in her speaker room.

If you look up any topic, you'll find thousands of reading materials and thousands of papers associated with it. But that's been my learning curve. I just understand what it's trying to solve from where. It's trying to solve some of the challenges that Word2Vec cannot handle, building on top of that.

What are the parameters, inputs needed? What do the output categories look like? And then what are the internal steps? It's doing mask modeling, it's doing sentence prediction. These are the two main features. Okay, now what kind of data do I need to solve some challenges? So yeah, just to keep that process and thinking flow going without getting too stuck in the internal goals. But if you're interested, please you can always do that I guess. Definitely.

And honestly, I think a lot of us would be interested in hearing way more details. But unfortunately, at this moment, we're kind of out of time for Q&As. But Jaytha will definitely be available in her speaker room starting right now. So if you have any questions that you need to ask her, check out her. You can go talk to her right now and go to her speaker room and I'm sure she would be splendidly pleasant to answer some of your wonderful questions. Thank you so much. It was great having you here Jaytha. Thank you so much, Ajay. Of course, it was a pleasure having you here. And we'll see you very soon. Take care.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

TensorFlow.js 101: ML in the Browser and Beyond

ML conf EU 2020

41 min

TensorFlow.js 101: ML in the Browser and Beyond

Jason Mayes

Web AI Lead at Google.

TensorFlow.js enables machine learning in the browser and beyond, with features like face mesh, body segmentation, and pose estimation. It offers JavaScript prototyping and transfer learning capabilities, as well as the ability to recognize custom objects using the Image Project feature. TensorFlow.js can be used with Cloud AutoML for training custom vision models and provides performance benefits in both JavaScript and Python development. It offers interactivity, reach, scale, and performance, and encourages community engagement and collaboration between the JavaScript and machine learning communities.

tensorflow machine learning innovation tensorflow react

Transformers.js: State-of-the-Art Machine Learning for the Web

JSNation 2025

27 min

Transformers.js: State-of-the-Art Machine Learning for the Web

Joshua Lochner

Hugging Face

Joshua introduces Transformers JS and Hugging Face, emphasizing community collaboration and pre-trained models. Transformers JS evolution led to 1.4 million monthly users, supporting 155 architectures. The library's browser-based capabilities offer real-time processing, cost-efficiency, and scalability. Integration enhancements include native web GPU execution and React Native implementation. Web ML implementation focuses on Onyx Runtime for device execution and web GPU for resource optimization. Browser-based ML applications cover vision, speech recognition, and text-to-speech. Advanced implementations include multimodal applications and educational tools. Interactive AI demonstrations showcase semantic search and conversational AI scenarios. Model licensing transitions to ECMAScript for efficiency and model redownloading factors are discussed.

machine learning webgpu

Using MediaPipe to Create Cross Platform Machine Learning Applications with React

React Advanced 2021

21 min

Using MediaPipe to Create Cross Platform Machine Learning Applications with React

Workshops on related topic

Can LLMs Learn? Let’s Customize an LLM to Chat With Your Own Data

C3 Dev Festival 2024

48 min

Can LLMs Learn? Let’s Customize an LLM to Chat With Your Own Data

WorkshopFree

Andreia Ocanoaia

Feeling the limitations of LLMs? They can be creative, but sometimes lack accuracy or rely on outdated information. In this workshop, we’ll break down the process of building and easily deploying a Retrieval-Augmented Generation system. This approach enables you to leverage the power of LLMs with the added benefit of factual accuracy and up-to-date information.

machine learning artificial intelligence openai

Leveraging LLMs to Build Intuitive AI Experiences With JavaScript

JSNation 2024

108 min

Leveraging LLMs to Build Intuitive AI Experiences With JavaScript

Workshop

2 authors

Today every developer is using LLMs in different forms and shapes, from ChatGPT to code assistants like GitHub CoPilot. Following this, lots of products have introduced embedded AI capabilities, and in this workshop we will make LLMs understandable for web developers. And we'll get into coding your own AI-driven application. No prior experience in working with LLMs or machine learning is needed. Instead, we'll use web technologies such as JavaScript, React which you already know and love while also learning about some new libraries like OpenAI, Transformers.js

machine learning artificial intelligence openai

Let AI Be Your Docs

JSNation 2024

69 min

Let AI Be Your Docs

Workshop

Jesse Hall

Join our dynamic workshop to craft an AI-powered documentation portal. Learn to integrate OpenAI's ChatGPT with Next.js 14, Tailwind CSS, and cutting-edge tech to deliver instant code solutions and summaries. This hands-on session will equip you with the knowledge to revolutionize how users interact with documentation, turning tedious searches into efficient, intelligent discovery.
Key Takeaways:
- Practical experience in creating an AI-driven documentation site.- Understanding the integration of AI into user experiences.- Hands-on skills with the latest web development technologies.- Strategies for deploying and maintaining intelligent documentation resources.
Table of contents:- Introduction to AI in Documentation- Setting Up the Environment- Building the Documentation Structure- Integrating ChatGPT for Interactive Docs

frameworks machine learning artificial intelligence

Hands on with TensorFlow.js

ML conf EU 2020

160 min

Hands on with TensorFlow.js

Workshop

Jason Mayes

Come check out our workshop which will walk you through 3 common journeys when using TensorFlow.js. We will start with demonstrating how to use one of our pre-made models - super easy to use JS classes to get you working with ML fast. We will then look into how to retrain one of these models in minutes using in browser transfer learning via Teachable Machine and how that can be then used on your own custom website, and finally end with a hello world of writing your own model code from scratch to make a simple linear regression to predict fictional house prices based on their square footage.

tensorflow machine learning

The Hitchhiker's Guide to the Machine Learning Engineering Galaxy

ML conf EU 2020

112 min

The Hitchhiker's Guide to the Machine Learning Engineering Galaxy

Workshop

Alyona Galyeva

Are you a Software Engineer who got tasked to deploy a machine learning or deep learning model for the first time in your life? Are you wondering what steps to take and how AI-powered software is different from traditional software? Then it is the right workshop to attend.
The internet offers thousands of articles and free of charge courses, showing how it is easy to train and deploy a simple AI model. At the same time in reality it is difficult to integrate a real model into the current infrastructure, debug, test, deploy, and monitor it properly. In this workshop, I will guide you through this process sharing tips, tricks, and favorite open source tools that will make your life much easier. So, at the end of the workshop, you will know where to start your deployment journey, what tools to use, and what questions to ask.

machine learning

Introduction to Machine Learning on the Cloud

ML conf EU 2020

146 min

Introduction to Machine Learning on the Cloud

Workshop

Dmitry Soshnikov

This workshop will be both a gentle introduction to Machine Learning, and a practical exercise of using the cloud to train simple and not-so-simple machine learning models. We will start with using Automatic ML to train the model to predict survival on Titanic, and then move to more complex machine learning tasks such as hyperparameter optimization and scheduling series of experiments on the compute cluster. Finally, I will show how Azure Machine Learning can be used to generate artificial paintings using Generative Adversarial Networks, and how to train language question-answering model on COVID papers to answer COVID-related questions.

azure machine learning