InterleaveDocuments

InterleaveDocuments interleaves documents that come from multiple retriever nodes. It's particularly useful in document search pipelines that pre-filter documents for labeling. When you label data to prepare annotated datasets for your document search pipelines, you need a way of pre-filtering the documents that you'll label. By using a pipeline that employs diverse retrieval methods, along with InterleaveDocuments, you ensure there's a diverse selection of documents for labeling. You can set the interleaving mode that best suits your needs.

InterleaveDocuments seems similar to JoinDocuments, but the purposes of these two nodes differ. JoinDocuments aims to optimize the retrieval by combining the output of each retriever to achieve the best results, while InterleaveDocuments is designed explicitly for labeling.

Basic Information

  • Pipeline type: Used in query pipelines.
  • Nodes that can precede it in a pipeline: Used after Retriever, Ranker
  • Nodes that can follow it in a pipeline: Retriever, Ranker, Reader, PromptNode
  • Input: Documents
  • Output: Documents (a single list of interleaved documents)
  • Available node classes: InterleaveDocuments

Usage Example

This is an example of a document search pipeline used for pre-filtering documents for labeling:


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
    params:
      embedding_dim: 1024
      similarity: cosine

  - name: BM25Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20
  
  - name: EmbeddingRetriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: intfloat/multilingual-e5-large
      model_format: sentence_transformers
      top_k: 20
      scale_score: false

  - name: InterleaveDocuments
    type: InterleaveDocuments
    params:
      interleaving_mode: random
      top_k_join: 15
      score_mode: none
  
  - name: Preprocessor
    type: PreProcessor
    params:
      split_by: word
      split_length: 200
      split_overlap: 0
      split_respect_sentence_boundary: true

  - name: FileTypeClassifier
    type: FileTypeClassifier

  - name: TextConverter
    type: TextConverter

  - name: PDFConverter
    type: PDFToTextConverter

pipelines:
  - name: query
    nodes:
      - name: BM25Retriever #sparse
        inputs: [Query]
      - name: EmbeddingRetriever # dense
        inputs: [Query]
      - name: InterleaveDocuments
        inputs:
          - BM25Retriever
          - EmbeddingRetriever

  - name: indexing
    nodes:
        - name: FileTypeClassifier
          inputs: [File]
        - name: TextConverter
          inputs: [FileTypeClassifier.output_1]
        - name: PDFConverter
          inputs: [FileTypeClassifier.output_2]
        - name: Preprocessor
          inputs: [TextConverter, PDFConverter]
        - name: EmbeddingRetriever
          inputs: [Preprocessor]
        - name: DocumentStore
          inputs: [EmbeddingRetriever]

Parameters

Here are the parameters you can pass to InterleaveDocuments in the pipeline YAML:

ParameterTypePossible ValuesDescription
interleaving_modeLiteralteam_draft
random
balanced
optimized
probabilistic
pairwise_preference
Default: team_draft
The interleaving mode to use. You can choose one of the following modes:
- team_draft: Alternates between picking documents from each retriever in turn. For example, if you have RetrieverA and RetrieverB, this mode picks a document from RetrieverA, then a document from RetrieverB, then a document from RetrieverA, and so on.
- random: Randomly interleaves the documents the retrievers return.
- balanced: Works similarly to the team_draft mode with the difference that it ensures documents coming from each retriever have an equal chance of appearing in each position in the final list. This means that if a document from RetrieverA is chosen first for the top position, the next time (for a different query) a document from RetrieverB is chosen first for the top position. Can work with a maximum of two input lists.
- optimized: Interleaves documents so that the differences in the effectiveness of the retrievers are most evident, based on the Multileaved Comparisons for Fast Online Evaluation paper.
- probabilistic: Uses statistical probability when choosing a document from a particular retriever.
- pairwise_preference: Interleaves pairs of documents in a single list. Each pair consists of documents, each from a different retriever, that are similar or closely ranked.
Required.
top_k_joinIntegerDefault: NoneThe maximum number of documents to return in an interleaved list. If set to None, returns all documents.
Optional.
score_modeLiteralnone
keep
mock
Default: none
Sets the score of the documents. Possible modes:
- none: Sets the score to None.
- keep: Keeps the original score the document got from the retriever.
- mock: Sets the score to 1.
Required.
shortest_capBooleanTrue
False
Default: False
It's useful when the lists of documents coming from different retrievers significantly differ in length.
When set to True, it only interleaves as many items as the length of the shortest input list. This way, you can avoid a situation where one of the retrievers returns a longer list of documents, and then the interleaved list contains mostly documents from this retriever.
If the shortest list is empty, you get no results.
Required.