Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DocumentLanguageClassifier

Classify documents by language and add the language to the document's metadata.

Key Features

  • Detects the language of each document using the langdetect library
  • Adds a language metadata field with the ISO language code to each document
  • Supports all languages available in langdetect
  • Marks documents in languages not included in the configured list as unmatched
  • Defaults to English (en) when no languages are specified
  • Works well with MetadataRouter for language-based document routing

Configuration

  1. Drag the DocumentLanguageClassifier component onto the canvas from the Component Library.
  2. Click the component to open the configuration panel.
  3. Configure the parameters as needed. Pass a list of ISO language codes you want the component to classify your documents by. If a language is not included in the list, documents in that language are classified as unmatched.

Connections

DocumentLanguageClassifier accepts a list of documents as input. It outputs the same documents with an added language metadata field. It typically receives documents from converters such as TextFileToDocument and sends classified documents to MetadataRouter, which routes them further down the pipeline based on their language. It can also connect to any component that outputs or accepts documents.

Usage Example

Using the Component in an Index

In this index, DocumentLanguageClassifier sends classified documents to MetadataRouter. MetadataRouter then routes English documents to a different document store than the German documents.

components:
DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:
languages:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
MetadataRouter:
type: haystack.components.routers.metadata_router.MetadataRouter
init_parameters:
rules:
english:
operator: OR
conditions:
- field: language
operator: ==
value: en
german:
operator: OR
conditions:
- field: language
operator: ==
value: de
DocumentWriter_English:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
DocumentWriter_German:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack.document_stores.in_memory.document_store.InMemoryDocumentStore
init_parameters:
bm25_tokenization_regex: (?u)\b\w\w+\b
bm25_algorithm: BM25L
bm25_parameters:
embedding_similarity_function: dot_product
index: ''
async_executor:

connections: # Defines how the components are connected
- sender: TextFileToDocument.documents
receiver: DocumentLanguageClassifier.documents
- sender: DocumentLanguageClassifier.documents
receiver: MetadataRouter.documents
- sender: MetadataRouter.english
receiver: DocumentWriter_English.documents
- sender: MetadataRouter.german
receiver: DocumentWriter_German.documents

inputs: # Define the inputs for your pipeline
files:
- TextFileToDocument.sources

max_runs_per_component: 100

metadata: {}

Parameters

Inputs

ParameterTypeDefaultDescription
documentsList[Document]A list of documents for language classification.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A list of documents with an added language metadata field.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
languagesOptional[List[str]]NoneA list of ISO language codes you want to classify your documents by. For a list of supported languages, see the langdetect documentation. If not specified, defaults to ["en"].

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]A list of documents for language classification.