Language Models

With such a variety of readily available fine-tuned models, it's difficult to choose the best one for your use case. This guide tells you about different types of models and when to use them.

What Are Language Models?

Language models are neural networks trained on large amounts of text to learn how a language works. If you think about the way you're using your native tongue, you'll observe that it's intuitive. You don't have to think about grammar rules when speaking, you just know and apply them intuitively. A language model is an attempt to enable machines to use the language in a similar way humans do.

Language models are no magic. The way a model works is that it decides how probable it is that a word is valid in a particular word sequence. By valid, we mean that it resembles the way humans talk, not that it complies with any grammar rules.

The latest advance in NLP is the transformer architecture which led to top-performing models, such as BERT. It differs from previously used architectures in that instead of processing passages of text word-by-word, transformer-based models can process whole sentences at once, which makes them much faster in training but also in inference. They can also learn the relations between different words in a sentence and how these relations affect the expected model output—a functionality called attention mechanism.

Fine-Tuned Models

A fine-tuned model is a language model that was trained on a large amount of raw text to perform a specific NLP task, such as text classification or question answering. A lot of fine-tuned models are publicly available for free on model hubs such as Hugging Face. This means that you can just grab them from there and use them in your NLP app as a starting point. You no longer need to train your own model, which takes a lot of time and a lot of data, and uses a lot of computer resources.

Model Naming Conventions

If you're new to models, it's helpful to understand the model naming convention as already the name gives you a hint about what the model can do. The first part of the name is the user that trained and uploaded the model. Then, there is the architecture that the model uses and the size of the model. The last part of the name is the dataset used to train the model.
Here's a typical model name:


deepset is the user who trained and uploaded the model, roberta is the model architecture, base is the size, and squad2 is the dataset on which the model was trained.

Model Types

You don't need to have in-depth knowledge of the architecture to be able to use a model but it's helpful to have a rough overview of the most common model types out there and their key differences. This will help you to better understand which models to try out in your own pipeline. Here's a timeline showing when the most common model architectures were developed.

A timeline showing when particular models were invented.A timeline showing when particular models were invented.

To get an idea of how different model types perform, have a look at Reader Performance Benchmarks. For a list of recommended models, see Models in deepset Cloud.

Model Size

Large models have better accuracy but it comes at a cost of speed and the resources it needs to run. Large models can be significantly slower than smaller ones and they use up a lot of your graphic card memory or even exceed its limit. Distilled models are a good compromise, as they preserve accuracy similar to the larger models but are much smaller and easier to deploy. To learn more about it, see Model Distillation.

Training Dataset

There are standard datasets used when training a model for a particular task. For example, SQuAD stands for Stanford Question Answering Dataset and is the standard dataset used for training English-language models for question answering.


When searching for a model for your task, you can check what dataset is typically used for it and then search for a model that was trained on this dataset.

Related Links