Document Search Pipelines

Document search pipelines, also called document retrieval pipelines, return whole documents as results. They are also the first stage in other types of pipelines, for example in retrieval augmented generation (RAG) pipelines.

📘

API Key

To reuse these pipelines, first make sure you have the API key needed to access the models. You'll need an API key for OpenAI, Cohere, and Hugging Face models. You can add them in the Connections tab in deepset Cloud.

Semantic Document Search Pipeline

This pipeline uses a vector-based retriever to fetch relevant documents based on their semantic similarity to the query.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      embedding_dim: 768
      similarity: cosine
  - name: Retriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters
    type: FileTypeClassifier
  - name: TextConverter # Converts TXT files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      # Depending on the file type we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Semantic Document Search with a Ranker

This document search pipeline searches for documents based on semantic similarity. It uses a vector-based search followed by re-ranking with a powerful cross-encoder model. This means that the resulting documents are ordered by the most relevant one.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      embedding_dim: 768
      similarity: cosine
  - name: Retriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: intfloat/simlm-msmarco-reranker # Fast model optimized for reranking
      top_k: 20 # The number of results to return
      batch_size: 20  # Try to keep this number equal or larger to the sum of the top_k of the two retrievers so all docs are processed at once
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters
    type: FileTypeClassifier
  - name: TextConverter # Converts TXT files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reranker
        inputs: [Retriever]
  - name: indexing
    nodes:
      # Depending on the file type we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Keyword Document Search

This pipeline is a good starting point for a document search pipeline. It returns documents as answers based on keyword matches with your query. It uses the BM25 algorithm to prioritize the keywords.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store
    type: BM25Retriever # The keyword-based retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits files into smaller documents and cleans them up
    type: PreProcessor
    params:
      # With a keyword-based retriever, you can keep slightly longer documents
      split_by: word # The unit by which you want to split the documents
      split_length: 500 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines: 
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Hybrid Document Search

These pipelines combine the advantages of keyword-based and vector-based searches. Such a combination usually yields the best results without having to train the model.

Hybrid Document Search with a Ranker

This pipeline uses a Ranker to rank the documents according to how relevant they are to the query.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      embedding_dim: 768
      similarity: cosine
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: intfloat/simlm-msmarco-reranker # Fast model optimized for reranking
      top_k: 20 # The number of results to return
      batch_size: 40  # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Hybrid Document Search with Fuzzy Matching

This pipeline accommodates typos the user may make when typing the query. It does this by using a custom OpenSearch query with the BM25Retriever.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
  - name: BM25Retriever # Selects the most relevant documents from the document store
    type: BM25Retriever # The keyword-based retriever
    params:
      document_store: DocumentStore
      top_k: 5 # The number of results to return
      all_terms_must_match: true
      custom_query: >
      	{"query": {
        	"multi_match": { 
          	"query": $query,  
            "fields": ["content"], 
            "fuzziness": "AUTO", 
            "operator": "or"
            }
           },
           "highlight": {
           	"fields": {
            	"content": {	
              }
             }
            }
           }
  - name: EmbeddingRetriever # The vector-based retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 5 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: reciprocal_rank_fusion # Applies rank-based scoring to the results
  - name: Ranker
    type: SentenceTransformersRanker
    params:
      model_name_or_path: intfloat/simlm-msmarco-reranker # Fast model optimized for reranking
      top_k: 20 # The number of results to return
      batch_size: 40  # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: MarkdownConverter # Converts PDFs into documents
    type: MarkdownConverter
    params:
      add_frontmatter_to_meta: false 
      extract_headlines: true
  - name: Preprocessor # Splits files into smaller documents and cleans them up
    type: PreProcessor
    params:
      # With a keyword-based retriever, you can keep slightly longer documents
      split_by: word # The unit by which you want to split the documents
      split_length: 50 # The max number of words in a document
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines: 
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Ranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: MarkdownConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: Preprocessor
        inputs: [MarkdownConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Hybrid Document Search for Multilingual Documents

This pipeline uses a Retriever model optimized for search on over 90 languages. You can use it for document search if your documents are multilingual.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      similarity: cosine # Recommended similarity for intfloat/multilingual-e5-base
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/multilingual-e5-base # Model optimized for semantic search on over 90 languages. Check https://huggingface.co/intfloat/multilingual-e5-base for more information.
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: jeffwan/mmarco-mMiniLMv2-L12-H384-v1 # Model optimized for reranking on 14 languages. Check https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 for more information.
      top_k: 20 # The number of results to return
      batch_size: 40  # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Choose from either de, en, fr, it, nl, pt, ru. Used by NLTK to best detect the sentence boundaries for that language.

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Hybrid Document Search for German Documents

This pipeline is optimized for German. It uses a Retriever model that performs semantic search on German documents.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      embedding_dim: 768
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: PM-AI/bi-encoder_msmarco_bert-base_german # Model optimized for semantic search.
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: svalabs/cross-electra-ms-marco-german-uncased # Model optimized for reranking
      top_k: 20 # The number of results to return
      batch_size: 40  # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
      model_kwargs:  # Additional keyword arguments for the model
        torch_dtype: torch.float16
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: de # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Pipelines Prioritizing Documents Based on Their Metadata

Prioritizing the Newest Documents with RecentnessRanker

You can add RecentnessRanker to your query pipeline to prioritize documents based on the criteria you specify. This pipeline uses RecentnessRanker to prioritize the newest documents. It does so by using the document's metadata field containing the date when the document was created.

components:
	- name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store so that the LLM can base its generation on it. 
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2  # Model optimized for semantic search 
      model_format: sentence_transformers
      top_k: 1 # The number of documents to return
  - name: PromptNode # The component that generates the answer based on the documents it gets from the retriever 
    type: PromptNode
    params:
      default_prompt_template: question-answering # A default prompt for question answering. 
      model_name_or_path: google/flan-t5-large # A free large language model for PromptNode. For production scenarios, we recommend a paid model.
      top_k: 3 # The number of answers to generate, you can change this value.
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language
	- name: Ranker
  	type: RecentnessRanker
    params:
    	date_identifier: updated_at # this is the name of the document's metadata field containing the date
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Ranker
        inputs: [Retriever]
      - name: PromptNode
        inputs: [Ranker]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Prioritizing the Newest Documents with an OpenSearch Query

You can pass a custom query to BM25Retriever to configure how you want it to fetch documents from the Document Store. Here's an example of a query that makes the retriever fetch the newest documents:


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: >
        {
          "query": {
            "function_score": {
        			"query": {
                "bool": {
                  "must": {
                    "match": {
                      "content": $query
                    }
                  },
                  "filter": $filters
                }
              },
              "gauss": {
                "_file_created_at": {
        					"origin": "now",
                  "offset": "30d",
                  "scale": "180d"
                }
              }
            }
          }
        }
  - name: TextConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Preprocessor]

For an in-depth explanation, see Boosting Retrieval with OpenSearch Queries.

Prioritizing Documents Based on Textual Values

You can prioritize documents whose metadata fields contain a particular text string. This pipeline gives the highest priority to documents with the metadata field file_type="article" and slightly lower priority to documents with metadata field file_type="paper" and the lowest priority to documents with metadata field file_type="comment".


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: >
        {
          "query": {
            "function_score": {
	        		"query": {
                "bool": {
                  "must": {
                    "match": {
                      "content": $query
                    }
                  },
                  "filter": $filters
                }
              },
              "functions": [
                {
                  "filter": { 
                    "terms": { 
                      "file_type": ["article", "paper"]
                    }
                  },
                  "weight": 2.0
                },
                {
                  "filter": { 
                    "terms": { 
                      "file_type": ["comment"]
                    }
                  },
                  "weight": 1.5
                },
				        {
                  "filter": { 
                    "terms": { 
                      "file_type": ["archive"]
                    }
                  },
                  "weight": 0.5
                }
              ]
            }
          }
        }
  - name: TextConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Preprocessor]

Prioritizing Documents Based on Numerical Values Such As "Likes"

You can create a query to prioritize documents with a metadata field containing numerical values. Say you collect popularity metrics for your documents, such as likes, and you want to favor documents that are the most popular.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      custom_query: >
        {
          "query": {
            "function_score": {
			        "query": {
                "bool": {
                  "must": {
                    "match": {
                      "content": $query
                    }
                  },
                  "filter": $filters
                }
              },
              "field_value_factor": {
                "field": "likes_last_month",
                "factor": 0.1,
                "modifier": "log1p",
                "missing": 0
              }
            }
          }
        }
  - name: TextConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor

pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Preprocessor]

Prioritizing Documents with a Certain Metadata Field

This pipeline prioritizes documents with the company metadata field during retrieval. You simply pass the name of the metadata field you want to prioritize in the embed_meta_field parameter of EmbeddingRetriever.

You can also pass this parameter for SentenceTransformersRanker and CohereRanker, but then the prioritization happens after the retrieval. This means, first, the retriever fetches the documents from the DocumentStore, and only then the ranker prioritizes the documents containing the fields you specified.

In this pipeline, the Retriever prioritizes documents with embed_meta_field: company and passes these documents on to the LLM in the prompt.


components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2  # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      model_format: sentence_transformers
      top_k: 4 # The number of results to return
      embed_meta_fields:
        [company]
  - name: qa_template
    type: PromptTemplate
    params:
      output_parser:
        type: AnswerParser
      prompt: "You are a legal expert. \
        You answer questions truthfully based on provided documents. \
        For each document check whether it is related to the question. \
        Only use documents that are related to the question to answer it. \
        Ignore documents that are not related to the question. \
        If the answer exists in several documents, summarize them. \
        Only answer based on the documents provided. Don't make things up. \
        Always use references in the form [NUMBER OF DOCUMENT] when using information from a document. e.g. [3], for Document[3]. \
        The reference must only refer to the number that comes in square brackets after passage. \
        Otherwise, do not use brackets in your answer and reference ONLY the number of the passage without mentioning the word passage. \
        If the documents can't answer the question or you are unsure say: 'The answer can't be found in the text'. \
        {new_line}\
        These are the documents:\
        {join(documents, delimiter=new_line, pattern=new_line+'Document[$idx]:'+new_line+'$content')}\
        {new_line}\
        Question: {query}\
        {new_line}\
        Answer:\
        {new_line}"
  - name: PromptNode
    type: PromptNode
    params:
      model_name_or_path: gpt-35-turbo
      api_key: API_KEY 
      default_prompt_template: qa_template
      max_length: 400
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 10 # Enables the sliding window approach
      language: en
      split_respect_sentence_boundary: false
      add_page_number: true

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: PromptNode
        inputs: [EmbeddingRetriever]
  - name: indexing
    nodes:
    # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures that this converter receives txt files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures that this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]