JoinDocuments
This node joins the documents fetched by different nodes back together. Use it to merge separate pipeline branches, for example the output of two different retrievers.
One example of when JoinDocuments comes in handy is if you want to use both a sparse and a dense retriever in a pipeline. JoinDocuments combines the output of the two retrievers, so you get the best of both worlds.
JoinDocuments takes documents as input and returns the combined documents as output. You can choose the way the documents are joined.
Usage
Initializing JoinDocuments
To initialize the node, run:
from haystack.nodes import JoinDocuments
join_documents = JoinDocuments(
join_mode="concatenate",
top_k_join=10
)
Or in YAML:
components:
- name: JoiningNode
type: JoinDocuments
params:
join_mode: "concatenate"
top_k_join: 10
Adding JoinDocuments to a Pipeline
In this example, JoinDocuments combines the documents retrieved by the BM25Retriever and the EmbeddingRetriever.
This code works if the pipeline connects to the correct DeepsetCloudDocumentStore, so if you're a new user, check how to connect to the DeepsetCloudDocumentStore.
from haystack.document_stores import DeepsetCloudDocumentStore
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader, FileTypeClassifier, JoinDocuments, PDFToTextConverter, PreProcessor, TextConverter
from haystack.pipelines import Pipeline
document_store = DeepsetCloudDocumentStore(index="Hybrid_ExtractiveQA")
document_store.name = "DocumentStore"
bm25retriever = BM25Retriever(document_store=document_store, top_k=20)
embedding_retriever = EmbeddingRetriever(document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1", model_format="sentence_transformers", top_k=20)
join_results = JoinDocuments(join_mode="reciprocal_rank_fusion")
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled", context_window_size=700, batch_size=50)
query_pipeline = Pipeline()
query_pipeline.add_node(component=bm25retriever, name="BM25Retriever", inputs=["Query"])
query_pipeline.add_node(component=embedding_retriever, name="EmbeddingRetriever", inputs=["Query"])
query_pipeline.add_node(component=join_results, name="JoinResults", inputs=["BM25Retriever", "EmbeddingRetriever"])
query_pipeline.add_node(component=reader, name="Reader", inputs=["JoinResults"])
query_pipeline.run("What is the name of a Jazz musician?")
And here's the YAML version of the pipeline:
# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline-using-a-yaml-file.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press control + space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.
version: '1.14.0'
name: 'JoinDocuments_pipeline'
# This section defines the nodes you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you. You can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component.
components:
- name: DocumentStore
type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
- name: BM25Retriever # The keyword-based retriever
type: BM25Retriever
params:
document_store: DocumentStore
top_k: 20 # The number of results to return
- name: EmbeddingRetriever # The dense retriever
type: EmbeddingRetriever
params:
document_store: DocumentStore
embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
model_format: sentence_transformers
top_k: 20 # The number of results to return
- name: JoinResults # Joins the results from both retrievers
type: JoinDocuments
params:
join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
top_k_join: None # Returns only the top_k joined documents based on scoring defined by join_mode
- name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever
type: FARMReader # Transformer-based reader, specializes in extractive QA
params:
model_name_or_path: deepset/roberta-base-squad2-distilled # An optimized variant of BERT, a strong all-round model
context_window_size: 700 # The size of the window around the answer span
batch_size: 50
- name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
type: FileTypeClassifier
- name: TextConverter # Converts files into documents
type: TextConverter
- name: PDFConverter # Converts PDFs into documents
type: PDFToTextConverter
- name: Preprocessor # Splits documents into smaller ones and cleans them up
type: PreProcessor
params:
# With a dense retriever, it's good to split your documents into smaller ones
split_by: word # The unit by which you want to split the documents
split_length: 250 # The max number of words in a document
split_overlap: 50 # Enables the sliding window approach
split_respect_sentence_boundary: True # Retains complete sentences in split documents
language: en
# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
- name: query
nodes:
- name: BM25Retriever
inputs: [Query]
- name: EmbeddingRetriever
inputs: [Query]
- name: JoinResults
inputs: [BM25Retriever, EmbeddingRetriever]
- name: Reader
inputs: [JoinResults]
- name: indexing
nodes:
# Depending on the file type, we use a Text or PDF converter
- name: FileTypeClassifier
inputs: [File]
- name: TextConverter
inputs: [FileTypeClassifier.output_1] # Ensures that this converter receives txt files
- name: PDFConverter
inputs: [FileTypeClassifier.output_2] # Ensures that this converter receives PDFs
- name: Preprocessor
inputs: [TextConverter, PDFConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
Arguments
Use these parameters to specify how you want the JoinDocuments node to work:
Parameter | Type | Possible Values | Description |
---|---|---|---|
join_mode | String Mandatory | concatenate (default) merge reciprocal_rank_fusion | Specifies how the documents should be combined. Possible options: - concatenate - combines documents from multiple retrievers- merge - aggregates scores of individual documents- reciprocal_rank_fusion - applies rank-based scoring |
weights | List Optional | - | A list of weights for adjusting document scores when using the merge join mode. The number of entries in the list must be equal to the number of input nodes (retrievers). By default, each retriever score gets an equal weight. This parameter is not compatible with the concatenate join mode. |
top_k_join | Integer Optional | - | Limits the number of documents that JoinDocuments returns. |
sort_by_score | Boolean Mandatory | True (default) False | Decides if the incoming documents should be sorted by their score. If all your documents have score values, set this to True . Otherwise, set this to False . |
Updated 16 days ago