DocumentLanguageClassifier
Classify documents by language and add the language to the document's metadata.
Basic Information
- Type:
haystack_integrations.classifiers.document_language_classifier.DocumentLanguageClassifier - Components it can connect with:
TextFileToDocument:DocumentLanguageClassifierreceives documents fromTextFileToDocument.MetadataRouter:DocumentLanguageClassifiersends classified documents toMetadataRouterthat routes them further down the pipeline based on their language.- Any component that outputs documents or accepts documents as input
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents for language classification. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents with an added language metadata field. |
Overview
When you configure the component in your pipeline, pass a list of language ISO codes that you want DocumentLanguageClassifier to use for classifying your documents. If a language is not included in the list, documents in that language will be classified as unmatched. By default, DocumentLanguageClassifier classifies documents in English (en). All other documents are marked as unmatched.
Usage Example
Initializing the Component
components:
DocumentLanguageClassifier:
type: components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:
Using the Component in an Index
In this index, DocumentLanguageClassifier sends classified documents to MetadataRouter. MetadataRouter then routes English documents to a different document store than the German documents.
components:
DocumentLanguageClassifier:
type: haystack.components.classifiers.document_language_classifier.DocumentLanguageClassifier
init_parameters:
languages:
TextFileToDocument:
type: haystack.components.converters.txt.TextFileToDocument
init_parameters:
encoding: utf-8
store_full_path: false
MetadataRouter:
type: haystack.components.routers.metadata_router.MetadataRouter
init_parameters:
rules:
english:
operator: OR
conditions:
- field: language
operator: ==
value: en
german:
operator: OR
conditions:
- field: language
operator: ==
value: de
DocumentWriter_English:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: ''
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
DocumentWriter_German:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
policy: NONE
document_store:
type: haystack.document_stores.in_memory.document_store.InMemoryDocumentStore
init_parameters:
bm25_tokenization_regex: (?u)\b\w\w+\b
bm25_algorithm: BM25L
bm25_parameters:
embedding_similarity_function: dot_product
index: ''
async_executor:
connections: # Defines how the components are connected
- sender: TextFileToDocument.documents
receiver: DocumentLanguageClassifier.documents
- sender: DocumentLanguageClassifier.documents
receiver: MetadataRouter.documents
- sender: MetadataRouter.english
receiver: DocumentWriter_English.documents
- sender: MetadataRouter.german
receiver: DocumentWriter_German.documents
inputs: # Define the inputs for your pipeline
files:
- TextFileToDocument.sources
max_runs_per_component: 100
metadata: {}
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| languages | Optional[List[str]] | None | A list of ISO language codes you want to classify your documents by. For a list of supported languages, see the langdetect documentation. If not specified, defaults to ["en"]. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents for language classification. |
Was this page helpful?