Skip to main content

DocumentLengthRouter

Categorize documents based on the length of the content field and route them to the appropriate output.

Basic Information

  • Type: haystack.components.routers.DocumentLengthRouter
  • Components it can connect with:
    • Rankers: DocumentLengthRouter can receive documents from DocumentSplitter.
    • LLMDocumentContentExtractor: DocumentLengthRouter can send short documents to LLMDocumentContentExtractor for content extraction.
    • SentenceTransformersDocumentImageEmbedder: DocumentLengthRouter can send short documents to SentenceTransformersDocumentImageEmbedder for image embedding.

Inputs

ParameterTypeDefaultDescription
documentsList[Document]A list of documents to be categorized based on their content length.

Outputs

ParameterTypeDefaultDescription
short_documentsList[Document]A list of documents where content is None or the length of content is less than or equal to the threshold.
long_documentsList[Document]A list of documents where the length of content is greater than the threshold.

Overview

Work in Progress

Bear with us while we're working on adding pipeline examples and most common components connections.

DocumentLengthRouter categorizes documents based on the length of the content field and routes them to the appropriate output. A common use case for DocumentLengthRouter is handling documents obtained from PDFs that contain non-text content, such as scanned pages or images.

This component can detect empty or low-content documents and route them to components that perform OCR, generate captions, or compute image embeddings. Documents where content is None or whose character count is less than or equal to the threshold will be routed to the short_documents output. Otherwise, they are routed to the long_documents output.

To route only documents with None content to short_documents, set the threshold to a negative number.

Usage Example

Initializing the Component

components:
DocumentLengthRouter:
type: haystack.components.routers.document_length_router.DocumentLengthRouter
init_parameters:
threshold: 10

Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
thresholdint10The threshold for the number of characters in the document content field. Documents where content is None or whose character count is less than or equal to the threshold are routed to the short_documents output. Otherwise, they are routed to the long_documents output. To route only documents with None content to short_documents, set the threshold to a negative number.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsList[Document]A list of documents to be categorized based on their content length.