What is transfer learning and why does it make NLP easier?

Learn more about this technique and its applications

By Santiago Cisco, Data Scientist at etermax

In recent years, the emergence of new deep learning architectures made the performance of models grow exponentially. Thanks to the evolution of language models, since the emergence of Word2Vec in 2012 to the use of Transformers in the present day, they can be used in increasingly complex tasks. However, even if they perform better each day, they are a lot more expensive to train: a great volume of data and solid infrastructure are needed to train a transformer satisfactorily. This said, there is a technique that allows us to tackle this issue: transfer learning.

Language models

Language models take one or more sequences of text as input and utilizes them to perform a task, such as a prediction. This involves, in most cases, transforming text into a numeric representation and processing it to obtain a result.

There are different types of tasks these models can perform, such as classifying sentences into predefined categories, translating, summarizing, or generating text by encoding an input, and extracting answers from a textual input.

The evolution of language models came with an increased complexity and amount of parameters, which made training a model from scratch extremely expensive for plenty of applications. The graph below shows how the first transformer models, such as the ELMo and the first GPT, had around 100 million parameters. Two years later, the T5 and Turing models had already exceeded 10,000 million. To avoid this issue, a technique known as Transfer Learning is used.

Increase in the number of parameters in language models through the years. Source: HuggingFace.

Transfer Learning

Neural networks are a subset of Machine Learning in which node layers (neurons) are connected to each other through nonlinear activation functions. The first layers of the network learn general characteristics, while the last ones are in charge of extracting specific characteristics and making the prediction. This allows us to extract the first layers of a model trained with a huge amount of data and use it in another one that addresses a similar problem, but with a limited amount of data available to train. This way, the general characteristics extracted in the first model are used to speed up the training process in the second one, keeping the first layers, which have learned the general characteristics of the problem in which they have been trained on, and can be reutilized on the second task. This technique is called “Transfer Learning’, and it is widely used in NLP models. The first stage is known as ‘pre-training’, while the second one is called ‘fine-tuning’.

The large amount of text available on the internet today constitutes a cheap and accessible source of data for training a general model following semi-supervised learning. In this type of training, the target is classified automatically based on the inputs, without human supervision.

An example of a model that has been widely-used as a base for training other NLP models is BERT (Bidirectional Encoder Representations from Transformers), developed by Google in 2018. This model was trained on Wikipedia articles and the Google Books corpus to generate a dataset of 3.3 billion words. Originally, BERT was trained on two tasks. On the one hand, one word was removed (masked) from each of the sentences, so that the model could predict the missing word using the rest of the sentence as context. On the other hand, the model was trained with sentence pairs to predict whether the second sentence was the continuation of the first or not. This way, BERT acquired solid statistical knowledge of the general characteristics of language.

The following step is known as ‘fine-tuning’. It involves taking the first layers of the pre-trained model and changing the last layers for randomly initialized ones that learn the specific task, adding, for instance, a classifier to predict the category a question belongs to. Next, all layers containing new data are re-trained, or only the last ones if there is limited data availability. The idea behind this process is that, since the first layers already contain quite useful values to perform language-related tasks, training them is rather simple compared to training the entire model from scratch, which widely improves performance.

Pre-training process and fine-tuning. Elaborated based on: TensorFlow ML Tech Talks: Transfer learning and Transformer models.

Conclusion

Thanks to the recent revolution of NLP models, there are more and more opportunities in this field. For a company that specializes in trivia like etermax, the range of tools available for boosting the quality of our games is huge. However, this increase in performance came with higher costs regarding the infrastructure and data load needed to train a model from scratch. Techniques like Transfer Learning allow Data Science professionals in this type of companies to make the most of state-of-the -art techniques without it being prohibitively expensive.

Sources:

https://machinelearningmastery.com/transfer-learning-for-deep-learning/
https://huggingface.co/course
https://huggingface.co/blog/bert-101
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
https://towardsdatascience.com/evolution-of-language-models-n-grams-word-embeddings-attention-transformers-a688151825d2
TensorFlow ML Tech Talks: Transfer learning and Transformer models. Disponible en: https://www.youtube.com/watch?v=LE3NfEULV6k