DocumentSplitter Parameters

Customize DocumentSplitter to shorten your text documents.

YAML Init Parameters

These are the parameters you can pass to this component in the pipeline YAML configuration:

Parameter

Type

Possible values

Description

split_by

Literal

word
sentence
page
passage
Default: word

The unit by which the document should be split. Choose from word (splitting by " "), sentence (splitting by "."), page (splitting by "\f"), or passage (splitting by "\n\n").
Required.

split_length

Integer

Default: 200

The maximum number of units in each split. For example, if you set split_by: word and split_lenght: 20, each document will be no longer than 20 words.
Required.

split_overlap

Integer

Default: 0

The number of units that each split should overlap. For example, if you set split_overlap: 3 and split_by: word, each document will share three words with the previous document.
Required.

split_threshold

Integer

Default: 0

The minimum number of units that the split should have. If the split has fewer units than the threshold, it's attached to the previous split.
Required.


REST API Runtime Parameters

There are no runtime parameters you can pass to this component when making a request to the Search REST API endpoint.