JoinDocuments

This node joins the documents fetched by different nodes back together. Use it to merge separate pipeline branches, for example the output of two different retrievers.

One example of when JoinDocuments comes in handy is if you want to use both a keyword-based and a vector-based retriever in a pipeline. JoinDocuments combines the output of the two retrievers, so you get the best of both worlds.

JoinDocuments takes your documents as input and returns the combined documents as output. You can choose the way the documents are joined.

Basic Information

  • Pipeline type: Used in query pipelines.
  • Nodes that can precede it in a pipeline: Retriever
  • Nodes that can follow it in a pipeline: PromptNode, Ranker, Reader
  • Node input: Documents
  • Node output: Documents
  • Available node classes: JoinDocuments

Usage Example

In this example, JoinDocuments combines the documents retrieved by the BM25Retriever and the EmbeddingRetriever.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # The dense retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
      top_k_join: None # Returns only the top_k joined documents based on scoring defined by join_mode
  - name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever 
    type: FARMReader # Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/roberta-base-squad2-distilled # An optimized variant of BERT, a strong all-round model
      context_window_size: 700 # The size of the window around the answer span
      batch_size: 50
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a dense retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 50 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en

pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reader
        inputs: [JoinResults]
  - name: indexing
    nodes:
    # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures that this converter receives txt files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures that this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Parameters

Use these parameters to specify how you want the JoinDocuments node to work in the pipeline YAML:

ParameterTypePossible ValuesDescription
join_modeStringconcatenate (default)
merge
reciprocal_rank_fusion
Specifies how the documents should be combined. Possible options:

- concatenate - Combines documents from multiple retrievers.
- merge - Aggregates scores of individual documents.
- reciprocal_rank_fusion - Applies rank-based scoring.
Mandatory.
weightsListDefault: noneA list of weights for adjusting document scores when using the merge join mode. The number of entries in the list must be equal to the number of input nodes (retrievers). By default, each retriever score gets equal weight. This parameter is not compatible with the concatenate join mode.
Optional.
top_k_joinIntegerDefault: noneLimits the number of documents that JoinDocuments returns.
Optional.
sort_by_scoreBoolean
Mandatory
True (default)
False
Sorts the incoming documents should by their score. If all your documents have score values, set this to True. Otherwise, set this to False.