Skip to main content
For the complete documentation index for agents and LLMs, see llms.txt.

DocumentLanguageClassifier

Classify documents by language and add the language to the document's metadata.

Key Features

  • Detects the language of each document and stores it as a metadata field.
  • Supports a configurable list of target languages using ISO 639-1 codes.
  • Documents in languages not on the configured list are classified as unmatched.
  • Defaults to English (en) if no languages are specified.
  • Works well with MetadataRouter to route documents to different stores or processors by language.

Configuration

  1. Drag the DocumentLanguageClassifier component onto the canvas from the Component Library.
  2. Click on the component to open the configuration panel.
  3. Configure the component settings:
    • Enter the list of ISO 639-1 language codes you want to classify documents by, such as en for English or de for German. For a full list of supported languages, see the langdetect documentation.

Connections

DocumentLanguageClassifier receives a list of documents as input, typically from a converter such as TextFileToDocument. It outputs a list of documents enriched with a language metadata field.

Connect its output to MetadataRouter to route documents to different downstream components or document stores based on the detected language. You can also connect it to any component that accepts documents as input.

Source Code

To check this component's source code, open document_language_classifier.py in the Haystack repository.

Usage Examples

Basic Configuration

  DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters: {}

Using the Component in an Index

In this index, DocumentLanguageClassifier sends classified documents to MetadataRouter. MetadataRouter then routes English documents to a different document store than the German documents.

# haystack-pipeline
components:
DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:
languages:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
MetadataRouter:
type: haystack.components.routers.metadata_router.MetadataRouter
init_parameters:
rules:
english:
operator: OR
conditions:
- field: language
operator: ==
value: en
german:
operator: OR
conditions:
- field: language
operator: ==
value: de
DocumentWriter_English:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
DocumentWriter_German:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack.document_stores.in_memory.document_store.InMemoryDocumentStore
init_parameters:
bm25_tokenization_regex: (?u)\b\w\w+\b
bm25_algorithm: BM25L
bm25_parameters:
embedding_similarity_function: dot_product
index: ''
async_executor:

connections: # Defines how the components are connected
- sender: TextFileToDocument.documents
receiver: DocumentLanguageClassifier.documents
- sender: DocumentLanguageClassifier.documents
receiver: MetadataRouter.documents
- sender: MetadataRouter.english
receiver: DocumentWriter_English.documents
- sender: MetadataRouter.german
receiver: DocumentWriter_German.documents

inputs: # Define the inputs for your pipeline
files:
- TextFileToDocument.sources

max_runs_per_component: 100

metadata: {}

Parameters

Inputs

ParameterTypeDescription
documentsList[Document]A list of documents for language classification.

Outputs

ParameterTypeDescription
documentsList[Document]A list of documents with an added language metadata field.

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
languagesOptional[List[str]]NoneA list of ISO language codes you want to classify your documents by. For a list of supported languages, see the langdetect documentation. If not specified, defaults to ["en"].

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDescription
documentsList[Document]A list of documents for language classification.