DocumentJoiner
Join multiple lists of documents into a single list. This component is useful in hybrid retrieval pipelines that combine results from different retrieval strategies (for example, BM25 and embedding-based retrieval).
Basic Information
- Type:
haystack.components.joiners.document_joiner.DocumentJoiner - Components it can connect with:
- Retrievers: Receives documents from multiple retrievers to combine their results.
PromptBuilder: Sends the merged documents to build a prompt for generation.
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | Variadic[List[Document]] | List of list of documents to be merged. | |
| top_k | Optional[int] | None | The maximum number of documents to return. Overrides the instance's top_k if provided. |
Outputs
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | List[Document] | A dictionary with the following keys: - documents: Merged list of Documents |
Overview
DocumentJoiner merges multiple lists of documents into a single list using one of these join modes:
concatenate: Keeps the highest-scored document in case of duplicates.merge: Calculates a weighted sum of scores for duplicates and merges them.reciprocal_rank_fusion: Merges and assigns scores based on reciprocal rank fusion.distribution_based_rank_fusion: Merges and assigns scores based on scores distribution in each retriever.
Usage Example
This example shows a hybrid retrieval pipeline that combines BM25 and embedding-based retrieval, then joins the results.
components:
TextEmbedder:
type: haystack.components.embedders.sentence_transformers_text_embedder.SentenceTransformersTextEmbedder
init_parameters:
model: sentence-transformers/all-MiniLM-L6-v2
DocumentJoiner:
type: haystack.components.joiners.document_joiner.DocumentJoiner
init_parameters:
join_mode: reciprocal_rank_fusion
top_k: 10
sort_by_score: true
PromptBuilder:
type: haystack.components.builders.prompt_builder.PromptBuilder
init_parameters:
template: "Given the following documents, answer the question.\n\nDocuments:\n{% for doc in documents %}{{ doc.content }}\n{% endfor %}\n\nQuestion: {{ query }}"
OpenAIGenerator:
type: haystack.components.generators.openai.OpenAIGenerator
init_parameters:
api_key:
type: env_var
env_vars:
- OPENAI_API_KEY
strict: false
model: gpt-4o-mini
AnswerBuilder:
type: haystack.components.builders.answer_builder.AnswerBuilder
init_parameters:
pattern:
reference_pattern:
last_message_only: false
return_only_referenced_documents: true
OpenSearchEmbeddingRetriever:
type: haystack_integrations.components.retrievers.opensearch.embedding_retriever.OpenSearchEmbeddingRetriever
init_parameters:
filters:
top_k: 10
filter_policy: replace
custom_query:
raise_on_failure: true
efficient_filtering: true
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'Standard-Index-English'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
OpenSearchBM25Retriever:
type: haystack_integrations.components.retrievers.opensearch.bm25_retriever.OpenSearchBM25Retriever
init_parameters:
filters:
fuzziness: AUTO
top_k: 10
scale_score: false
all_terms_must_match: false
filter_policy: replace
custom_query:
raise_on_failure: true
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: 'Standard-Index-English'
max_chunk_bytes: 104857600
embedding_dim: 768
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
connections:
- sender: DocumentJoiner.documents
receiver: PromptBuilder.documents
- sender: PromptBuilder.prompt
receiver: OpenAIGenerator.prompt
- sender: OpenAIGenerator.replies
receiver: AnswerBuilder.replies
- sender: DocumentJoiner.documents
receiver: AnswerBuilder.documents
- sender: TextEmbedder.embedding
receiver: OpenSearchEmbeddingRetriever.query_embedding
- sender: OpenSearchEmbeddingRetriever.documents
receiver: DocumentJoiner.documents
- sender: OpenSearchBM25Retriever.documents
receiver: DocumentJoiner.documents
max_runs_per_component: 100
metadata: {}
inputs:
query:
- TextEmbedder.text
- PromptBuilder.query
- AnswerBuilder.query
- OpenSearchBM25Retriever.query
outputs:
answers: AnswerBuilder.answers
Parameters
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
| join_mode | Union[str, JoinMode] | JoinMode.CONCATENATE | Specifies the join mode to use. Available modes: - concatenate: Keeps the highest-scored document in case of duplicates. - merge: Calculates a weighted sum of scores for duplicates and merges them. - reciprocal_rank_fusion: Merges and assigns scores based on reciprocal rank fusion. - distribution_based_rank_fusion: Merges and assigns scores based on scores distribution in each Retriever. |
| weights | Optional[List[float]] | None | Assign importance to each list of documents to influence how they're joined. This parameter is ignored for concatenate or distribution_based_rank_fusion join modes. Weight for each list of documents must match the number of inputs. |
| top_k | Optional[int] | None | The maximum number of documents to return. |
| sort_by_score | bool | True | If True, sorts the documents by score in descending order. If a document has no score, it is handled as if its score is -infinity. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
| documents | Variadic[List[Document]] | List of list of documents to be merged. | |
| top_k | Optional[int] | None | The maximum number of documents to return. Overrides the instance's top_k if provided. |
Was this page helpful?