DocumentLanguageClassifier
Classify documents by language and add the language to the document's metadata.
Key Features
- Detects the language of each document and stores it as a metadata field.
- Supports a configurable list of target languages using ISO 639-1 codes.
- Documents in languages not on the configured list are classified as unmatched.
- Defaults to English (
en) if no languages are specified. - Works well with
MetadataRouterto route documents to different stores or processors by language.
Configuration
- Drag the
DocumentLanguageClassifiercomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Enter the list of ISO 639-1 language codes you want to classify documents by, such as
enfor English ordefor German. For a full list of supported languages, see thelangdetectdocumentation.
- Enter the list of ISO 639-1 language codes you want to classify documents by, such as
Connections
DocumentLanguageClassifier receives a list of documents as input, typically from a converter such as TextFileToDocument. It outputs a list of documents enriched with a language metadata field.
Connect its output to MetadataRouter to route documents to different downstream components or document stores based on the detected language. You can also connect it to any component that accepts documents as input.
Source Code
To check this component's source code, open document_language_classifier.py in the Haystack repository.
Usage Examples
Basic Configuration
DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters: {}
Using the Component in an Index
In this index, DocumentLanguageClassifier sends classified documents to MetadataRouter. MetadataRouter then routes English documents to a different document store than the German documents.
# haystack-pipeline
components:
DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:
languages:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
MetadataRouter:
type: haystack.components.routers.metadata_router.MetadataRouter
init_parameters:
rules:
english:
operator: OR
conditions:
- field: language
operator: ==
value: en
german:
operator: OR
conditions:
- field: language
operator: ==
value: de
DocumentWriter_English:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
DocumentWriter_German:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack.document_stores.in_memory.document_store.InMemoryDocumentStore
init_parameters:
bm25_tokenization_regex: (?u)\b\w\w+\b
bm25_algorithm: BM25L
bm25_parameters:
embedding_similarity_function: dot_product
index: ''
async_executor:
connections: # Defines how the components are connected
- sender: TextFileToDocument.documents
receiver: DocumentLanguageClassifier.documents
- sender: DocumentLanguageClassifier.documents
receiver: MetadataRouter.documents
- sender: MetadataRouter.english
receiver: DocumentWriter_English.documents
- sender: MetadataRouter.german
receiver: DocumentWriter_German.documents
inputs: # Define the inputs for your pipeline
files:
- TextFileToDocument.sources
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of documents for language classification. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of documents with an added language metadata field. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
languages | Optional[List[str]] | None | A list of ISO language codes you want to classify your documents by. For a list of supported languages, see the langdetect documentation. If not specified, defaults to ["en"]. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of documents for language classification. |
Was this page helpful?