RecursiveDocumentSplitter
Recursively chunk text into smaller chunks.
Basic Information
- Type:
haystack_integrations.preprocessors.recursive_splitter.RecursiveDocumentSplitter
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to split. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A dictionary containing a key "documents" with a List of Documents with smaller chunks of text corresponding to the input documents. |
Overview
Bear with us while we're working on adding pipeline examples and most common components connections.
Recursively chunk text into smaller chunks.
This component is used to split text into smaller chunks, it does so by recursively applying a list of separators to the text.
The separators are applied in the order they are provided, typically this is a list of separators that are applied in a specific order, being the last separator the most specific one.
Each separator is applied to the text, it then checks each of the resulting chunks, it keeps the chunks that are within the split_length, for the ones that are larger than the split_length, it applies the next separator in the list to the remaining text.
This is done until all chunks are smaller than the split_length parameter.
Example:
from haystack import Document
from haystack.components.preprocessors import RecursiveDocumentSplitter
chunker = RecursiveDocumentSplitter(split_length=260, split_overlap=0, separators=["\n\n", "\n", ".", " "])
text = ('''Artificial intelligence (AI) - Introduction
AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.
AI technology is widely used throughout industry, government, and science. Some high-profile applications include advanced web search engines; recommendation systems; interacting via human speech; autonomous vehicles; generative and creative tools; and superhuman play and analysis in strategy games.''')
chunker.warm_up()
doc = Document(content=text)
doc_chunks = chunker.run([doc])
print(doc_chunks["documents"])
>[
>Document(id=..., content: 'Artificial intelligence (AI) - Introduction\n\n', meta: {'original_id': '...', 'split_id': 0, 'split_idx_start': 0, '_split_overlap': []})
>Document(id=..., content: 'AI, in its broadest sense, is intelligence exhibited by machines, particularly computer systems.\n', meta: {'original_id': '...', 'split_id': 1, 'split_idx_start': 45, '_split_overlap': []})
>Document(id=..., content: 'AI technology is widely used throughout industry, government, and science.', meta: {'original_id': '...', 'split_id': 2, 'split_idx_start': 142, '_split_overlap': []})
>Document(id=..., content: ' Some high-profile applications include advanced web search engines; recommendation systems; interac...', meta: {'original_id': '...', 'split_id': 3, 'split_idx_start': 216, '_split_overlap': []})
>]
Usage Example
components:
RecursiveDocumentSplitter:
type: components.preprocessors.recursive_splitter.RecursiveDocumentSplitter
init_parameters:
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| split_length | int | 200 | The maximum length of each chunk by default in words, but can be in characters or tokens. See the split_units parameter. |
| split_overlap | int | 0 | The number of characters to overlap between consecutive chunks. |
| split_unit | Literal['word', 'char', 'token'] | word | The unit of the split_length parameter. It can be either "word", "char", or "token". If "token" is selected, the text will be split into tokens using the tiktoken tokenizer (o200k_base). |
| separators | Optional[List[str]] | None | An optional list of separator strings to use for splitting the text. The string separators will be treated as regular expressions unless the separator is "sentence", in that case the text will be split into sentences using a custom sentence tokenizer based on NLTK. See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter. If no separators are provided, the default separators ["\n\n", "sentence", "\n", " "] are used. |
| sentence_splitter_params | Optional[Dict[str, Any]] | None | Optional parameters to pass to the sentence tokenizer. See: haystack.components.preprocessors.sentence_tokenizer.SentenceSplitter for more information. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | List of Documents to split. |
Was this page helpful?