DocumentLengthRouter
Categorize documents based on the length of the content field and route them to the appropriate output.
Basic Information
- Type:
haystack.components.routers.DocumentLengthRouter - Components it can connect with:
Rankers:DocumentLengthRoutercan receive documents from DocumentSplitter.LLMDocumentContentExtractor:DocumentLengthRoutercan send short documents to LLMDocumentContentExtractor for content extraction.SentenceTransformersDocumentImageEmbedder:DocumentLengthRoutercan send short documents to SentenceTransformersDocumentImageEmbedder for image embedding.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents to be categorized based on their content length. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| short_documents | List[Document] | A list of documents where content is None or the length of content is less than or equal to the threshold. | |
| long_documents | List[Document] | A list of documents where the length of content is greater than the threshold. |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
DocumentLengthRouter categorizes documents based on the length of the content field and routes them to the appropriate output. A common use case for DocumentLengthRouter is handling documents obtained from PDFs that contain non-text content, such as scanned pages or images.
This component can detect empty or low-content documents and route them to components that perform OCR, generate captions, or compute image embeddings. Documents where content is None or whose character count is less than or equal to the threshold will be routed to the short_documents output. Otherwise, they are routed to the long_documents output.
To route only documents with None content to short_documents, set the threshold to a negative number.
Usage Example
Initializing the Component
components:
DocumentLengthRouter:
type: haystack.components.routers.document_length_router.DocumentLengthRouter
init_parameters:
threshold: 10
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| threshold | int | 10 | The threshold for the number of characters in the document content field. Documents where content is None or whose character count is less than or equal to the threshold are routed to the short_documents output. Otherwise, they are routed to the long_documents output. To route only documents with None content to short_documents, set the threshold to a negative number. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents to be categorized based on their content length. |
Was this page helpful?