One example of when JoinDocuments comes in handy is if you want to use both a keyword-based and a vector-based retriever in a pipeline. JoinDocuments combines the output of the two retrievers, so you get the best of both worlds.
JoinDocuments takes your documents as input and returns the combined documents as output. You can choose the way the documents are joined.
- Pipeline type: Used in query pipelines.
- Position in a pipeline: After the retrievers.
- Node input: Documents
- Node output: Documents
- Available node classes: JoinDocuments
In this example, JoinDocuments combines the documents retrieved by the BM25Retriever and the EmbeddingRetriever.
version: '1.22.0' components: - name: DocumentStore type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud - name: BM25Retriever # The keyword-based retriever type: BM25Retriever params: document_store: DocumentStore top_k: 20 # The number of results to return - name: EmbeddingRetriever # The dense retriever type: EmbeddingRetriever params: document_store: DocumentStore embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search model_format: sentence_transformers top_k: 20 # The number of results to return - name: JoinResults # Joins the results from both retrievers type: JoinDocuments params: join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results top_k_join: None # Returns only the top_k joined documents based on scoring defined by join_mode - name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever type: FARMReader # Transformer-based reader, specializes in extractive QA params: model_name_or_path: deepset/roberta-base-squad2-distilled # An optimized variant of BERT, a strong all-round model context_window_size: 700 # The size of the window around the answer span batch_size: 50 - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html type: FileTypeClassifier - name: TextConverter # Converts files into documents type: TextConverter - name: PDFConverter # Converts PDFs into documents type: PDFToTextConverter - name: Preprocessor # Splits documents into smaller ones and cleans them up type: PreProcessor params: # With a dense retriever, it's good to split your documents into smaller ones split_by: word # The unit by which you want to split the documents split_length: 250 # The max number of words in a document split_overlap: 50 # Enables the sliding window approach split_respect_sentence_boundary: True # Retains complete sentences in split documents language: en pipelines: - name: query nodes: - name: BM25Retriever inputs: [Query] - name: EmbeddingRetriever inputs: [Query] - name: JoinResults inputs: [BM25Retriever, EmbeddingRetriever] - name: Reader inputs: [JoinResults] - name: indexing nodes: # Depending on the file type, we use a Text or PDF converter - name: FileTypeClassifier inputs: [File] - name: TextConverter inputs: [FileTypeClassifier.output_1] # Ensures that this converter receives txt files - name: PDFConverter inputs: [FileTypeClassifier.output_2] # Ensures that this converter receives PDFs - name: Preprocessor inputs: [TextConverter, PDFConverter] - name: Retriever inputs: [Preprocessor] - name: DocumentStore inputs: [Retriever]
Use these parameters to specify how you want the JoinDocuments node to work:
|String||Specifies how the documents should be combined. Possible options:|
|List||-||A list of weights for adjusting document scores when using the |
|Integer||-||Limits the number of documents that JoinDocuments returns.|
|Sorts the incoming documents should by their score. If all your documents have |
Updated 23 days ago