Skip to main content

DocumentJoiner

Join multiple lists of documents into a single list. This component is useful in hybrid retrieval pipelines that combine results from different retrieval strategies (for example, BM25 and embedding-based retrieval).

Basic Information

  • Type: haystack.components.joiners.document_joiner.DocumentJoiner
  • Components it can connect with:
    • Retrievers: Receives documents from multiple retrievers to combine their results.
    • PromptBuilder: Sends the merged documents to build a prompt for generation.

Inputs

ParameterTypeDefaultDescription
documentsVariadic[List[Document]]List of list of documents to be merged.
top_kOptional[int]NoneThe maximum number of documents to return. Overrides the instance's top_k if provided.

Outputs

ParameterTypeDefaultDescription
documentsList[Document]A dictionary with the following keys: - documents: Merged list of Documents

Overview

DocumentJoiner merges multiple lists of documents into a single list using one of these join modes:

  • concatenate: Keeps the highest-scored document in case of duplicates.
  • merge: Calculates a weighted sum of scores for duplicates and merges them.
  • reciprocal_rank_fusion: Merges and assigns scores based on reciprocal rank fusion.
  • distribution_based_rank_fusion: Merges and assigns scores based on scores distribution in each retriever.

Usage Example

This example shows a hybrid retrieval pipeline that combines BM25 and embedding-based retrieval, then joins the results.

components:
TextEmbedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
DocumentJoiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: reciprocal_rank_fusion
top_k: 10
sort_by_score: true
PromptBuilder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: "Given the following documents, answer the question.\n\nDocuments:\n{% for doc in documents %}{{ doc.content }}\n{% endfor %}\n\nQuestion: {{ query }}"
OpenAIGenerator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: false
model: gpt-4o-mini
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true

OpenSearchEmbeddingRetriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
filters:
top_k: 10
filter_policy: replace
custom_query:
raise_on_failure: true
efficient_filtering: true
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'Standard-Index-English'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
OpenSearchBM25Retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
filters:
fuzziness: AUTO
top_k: 10
scale_score: false
all_terms_must_match: false
filter_policy: replace
custom_query:
raise_on_failure: true
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'Standard-Index-English'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:

connections:
- sender: DocumentJoiner.documents
receiver: PromptBuilder.documents
- sender: PromptBuilder.prompt
receiver: OpenAIGenerator.prompt
- sender: OpenAIGenerator.replies
receiver: AnswerBuilder.replies
- sender: DocumentJoiner.documents
receiver: AnswerBuilder.documents

- sender: TextEmbedder.embedding
receiver: OpenSearchEmbeddingRetriever.query_embedding
- sender: OpenSearchEmbeddingRetriever.documents
receiver: DocumentJoiner.documents

- sender: OpenSearchBM25Retriever.documents
receiver: DocumentJoiner.documents

max_runs_per_component: 100

metadata: {}

inputs:
query:
- TextEmbedder.text
- PromptBuilder.query
- AnswerBuilder.query
- OpenSearchBM25Retriever.query

outputs:
answers: AnswerBuilder.answers


Parameters

Init Parameters

These are the parameters you can configure in Pipeline Builder:

ParameterTypeDefaultDescription
join_modeUnion[str, JoinMode]JoinMode.CONCATENATESpecifies the join mode to use. Available modes: - concatenate: Keeps the highest-scored document in case of duplicates. - merge: Calculates a weighted sum of scores for duplicates and merges them. - reciprocal_rank_fusion: Merges and assigns scores based on reciprocal rank fusion. - distribution_based_rank_fusion: Merges and assigns scores based on scores distribution in each Retriever.
weightsOptional[List[float]]NoneAssign importance to each list of documents to influence how they're joined. This parameter is ignored for concatenate or distribution_based_rank_fusion join modes. Weight for each list of documents must match the number of inputs.
top_kOptional[int]NoneThe maximum number of documents to return.
sort_by_scoreboolTrueIf True, sorts the documents by score in descending order. If a document has no score, it is handled as if its score is -infinity.

Run Method Parameters

These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.

ParameterTypeDefaultDescription
documentsVariadic[List[Document]]List of list of documents to be merged.
top_kOptional[int]NoneThe maximum number of documents to return. Overrides the instance's top_k if provided.