Skip to main content

DocumentLanguageClassifier

Classify documents by language and add the language to the document's metadata.

Basic Information

  • Type: haystack_integrations.classifiers.document_language_classifier.DocumentLanguageClassifier
  • Components it can connect with:
    • TextFileToDocument: DocumentLanguageClassifier receives documents from TextFileToDocument.
    • MetadataRouter: DocumentLanguageClassifier sends classified documents to MetadataRouter that routes them further down the pipeline based on their language.
    • Any component that outputs documents or accepts documents as input

Inputs

ParameterTypeDefaultDescription
documentsList[Document]A list of documents for language classification.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A list of documents with an added language metadata field.

Overview

When you configure the component in your pipeline, pass a list of language ISO codes that you want DocumentLanguageClassifier to use for classifying your documents. If a language is not included in the list, documents in that language will be classified as unmatched. By default, DocumentLanguageClassifier classifies documents in English (en). All other documents are marked as unmatched.

Usage Example

Initializing the Component

components:
DocumentLanguageClassifier:
type: components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:

Using the Component in an Index

In this index, DocumentLanguageClassifier sends classified documents to MetadataRouter. MetadataRouter then routes English documents to a different document store than the German documents.

components:
DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:
languages:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
MetadataRouter:
type: haystack.components.routers.metadata_router.MetadataRouter
init_parameters:
rules:
english:
operator: OR
conditions:
- field: language
operator: ==
value: en
german:
operator: OR
conditions:
- field: language
operator: ==
value: de
DocumentWriter_English:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
DocumentWriter_German:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack.document_stores.in_memory.document_store.InMemoryDocumentStore
init_parameters:
bm25_tokenization_regex: (?u)\b\w\w+\b
bm25_algorithm: BM25L
bm25_parameters:
embedding_similarity_function: dot_product
index: ''
async_executor:

connections: # Defines how the components are connected
- sender: TextFileToDocument.documents
receiver: DocumentLanguageClassifier.documents
- sender: DocumentLanguageClassifier.documents
receiver: MetadataRouter.documents
- sender: MetadataRouter.english
receiver: DocumentWriter_English.documents
- sender: MetadataRouter.german
receiver: DocumentWriter_German.documents

inputs: # Define the inputs for your pipeline
files:
- TextFileToDocument.sources

max_runs_per_component: 100

metadata: {}

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
languagesOptional[List[str]]NoneA list of ISO language codes you want to classify your documents by. For a list of supported languages, see the langdetect documentation. If not specified, defaults to ["en"].

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]A list of documents for language classification.