DocumentSplitter
Splits long documents into smaller chunks.
Basic Information
- Type:
haystack.components.preprocessors.document_splitter.DocumentSplitter
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | The documents to split. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A list of documents with the split texts. Each document includes: - A metadata field source_id to track the original document. - A metadata field page_number to track the original page number. - All other metadata copied from the original document. |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
Splits long documents into smaller chunks.
This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations and prevents exceeding language model context limits.
The DocumentSplitter is compatible with the following DocumentStores:
- Astra
- Chroma limited support, overlapping information is not stored
- Elasticsearch
- OpenSearch
- Pgvector
- Pinecone limited support, overlapping information is not stored
- Qdrant
- Weaviate
Usage Example
components:
DocumentSplitter:
type: components.preprocessors.document_splitter.DocumentSplitter
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| split_by | Literal['function', 'page', 'passage', 'period', 'word', 'line', 'sentence'] | word | The unit for splitting your documents. Choose from: - word for splitting by spaces (" ") - period for splitting by periods (".") - page for splitting by form feed ("\f") - passage for splitting by double line breaks ("\n\n") - line for splitting each line ("\n") - sentence for splitting by NLTK sentence tokenizer |
| split_length | int | 200 | The maximum number of units in each split. |
| split_overlap | int | 0 | The number of overlapping units for each split. |
| split_threshold | int | 0 | The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split. |
| splitting_function | Optional[Callable[[str], List[str]]] | None | Necessary when split_by is set to "function". This is a function which must accept a single str as input and return a list of str as output, representing the chunks after splitting. |
| respect_sentence_boundary | bool | False | Choose whether to respect sentence boundaries when splitting by "word". If True, uses NLTK to detect sentence boundaries, ensuring splits occur only between sentences. |
| language | Language | en | Choose the language for the NLTK tokenizer. The default is English ("en"). |
| use_split_rules | bool | True | Choose whether to use additional split rules when splitting by sentence. |
| extend_abbreviations | bool | True | Choose whether to extend NLTK's PunktTokenizer abbreviations with a list of curated abbreviations, if available. This is currently supported for English ("en") and German ("de"). |
| skip_empty_documents | bool | True | Choose whether to skip documents with empty content. Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text from non-textual documents. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | The documents to split. |
Was this page helpful?