InterleaveDocuments
InterleaveDocuments interleaves documents that come from multiple retriever nodes. It's particularly useful in document search pipelines that pre-filter documents for labeling. When you label data to prepare annotated datasets for your document search pipelines, you need a way of pre-filtering the documents that you'll label. By using a pipeline that employs diverse retrieval methods, along with InterleaveDocuments, you ensure there's a diverse selection of documents for labeling. You can set the interleaving mode that best suits your needs.
InterleaveDocuments seems similar to JoinDocuments, but the purposes of these two nodes differ. JoinDocuments aims to optimize the retrieval by combining the output of each retriever to achieve the best results, while InterleaveDocuments is designed explicitly for labeling.
Basic Information
- Pipeline type: Used in query pipelines.
- Nodes that can precede it in a pipeline: Used after Retriever, Ranker
- Nodes that can follow it in a pipeline: Retriever, Ranker, Reader, PromptNode
- Input: Documents
- Output: Documents (a single list of interleaved documents)
- Available node classes: InterleaveDocuments
Usage Example
This is an example of a document search pipeline used for pre-filtering documents for labeling:
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore
params:
embedding_dim: 1024
similarity: cosine
- name: BM25Retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20
- name: EmbeddingRetriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: intfloat/multilingual-e5-large
model_format: sentence_transformers
top_k: 20
scale_score: false
- name: InterleaveDocuments
type: InterleaveDocuments
params:
interleaving_mode: random
top_k_join: 15
score_mode: none
- name: Preprocessor
type: PreProcessor
params:
split_by: word
split_length: 200
split_overlap: 0
split_respect_sentence_boundary: true
- name: FileTypeClassifier
type: FileTypeClassifier
- name: TextConverter
type: TextConverter
- name: PDFConverter
type: PDFToTextConverter
pipelines:
- name: query
nodes:
- name: BM25Retriever #sparse
inputs: [Query]
- name: EmbeddingRetriever # dense
inputs: [Query]
- name: InterleaveDocuments
inputs:
- BM25Retriever
- EmbeddingRetriever
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1]
- name: PDFConverter
inputs: [FileTypeClassifier.output_2]
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: EmbeddingRetriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [EmbeddingRetriever]
Parameters
Here are the parameters you can pass to InterleaveDocuments in the pipeline YAML:
Parameter | Type | Possible Values | Description |
---|---|---|---|
interleaving_mode | Literal | team_draft random balanced optimized probabilistic pairwise_preference Default: team_draft | The interleaving mode to use. You can choose one of the following modes: - team_draft : Alternates between picking documents from each retriever in turn. For example, if you have RetrieverA and RetrieverB, this mode picks a document from RetrieverA, then a document from RetrieverB, then a document from RetrieverA, and so on.- random : Randomly interleaves the documents the retrievers return.- balanced : Works similarly to the team_draft mode with the difference that it ensures documents coming from each retriever have an equal chance of appearing in each position in the final list. This means that if a document from RetrieverA is chosen first for the top position, the next time (for a different query) a document from RetrieverB is chosen first for the top position. Can work with a maximum of two input lists.- optimized : Interleaves documents so that the differences in the effectiveness of the retrievers are most evident, based on the Multileaved Comparisons for Fast Online Evaluation paper.- probabilistic : Uses statistical probability when choosing a document from a particular retriever.- pairwise_preference : Interleaves pairs of documents in a single list. Each pair consists of documents, each from a different retriever, that are similar or closely ranked.Required. |
top_k_join | Integer | Default: None | The maximum number of documents to return in an interleaved list. If set to None , returns all documents.Optional. |
score_mode | Literal | none keep mock Default: none | Sets the score of the documents. Possible modes: - none : Sets the score to None.- keep : Keeps the original score the document got from the retriever.- mock : Sets the score to 1.Required. |
shortest_cap | Boolean | True False Default: False | It's useful when the lists of documents coming from different retrievers significantly differ in length. When set to True , it only interleaves as many items as the length of the shortest input list. This way, you can avoid a situation where one of the retrievers returns a longer list of documents, and then the interleaved list contains mostly documents from this retriever.If the shortest list is empty, you get no results. Required. |
Updated 7 months ago