Sample Pipelines

We created a couple of ready-to-use, NLP pipelines for you. They're available as templates in deepest Cloud. You can take them as they are, or you can adjust them for your use case.

All these pipelines are available as YAML templates in deepset Cloud. You can choose them when creating a pipeline with YAML editor. They're listed here for your reference.

Generative Question Answering Pipeline

This pipeline uses a large language model. The model generates the answer based on the model's general knowledge of the world and the documents you feed to it. The result is a novel text. Bear in mind that in generative QA pipelines, it's often hard to tell which document the answer comes from.

Generative Question Answering with GPT-3

This template uses Open AI's GPT-3 model text-davinci-003 to generate the answer and a good vector-based Retriever. You need an API key from an active Open AI account to use this model.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press control + space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is a Generative Question Answering pipeline for English with a good vector-based Retriever and OpenAI's GPT-3.5 model as a PromptNode
version: '1.20.0'
name: 'GenerativeQuestionAnswering_GPT-3.5'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 10 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      model_format: sentence_transformers
      top_k: 10 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: cross-encoder/ms-marco-MiniLM-L-6-v2 # Fast model optimized for reranking
      top_k: 4 # The number of results to return
      batch_size: 20  # Try to keep this number equal or larger to the sum of the top_k of the two retrievers so all docs are processed at once
  - name: qa_template
    type: PromptTemplate
    params:
      output_parser:
        type: AnswerParser
      prompt: "You are a technical expert. \
        You answer questions truthfully based on provided documents. \
        For each document check whether it is related to the question. \
        Only use documents that are related to the question to answer it. \
        Ignore documents that are not related to the question. \
        If the answer exists in several documents, summarize them. \
        Only answer based on the documents provided. Don't make things up. \
        Always use references in the form [NUMBER OF DOCUMENT] when using information from a document. e.g. [3], for Document[3]. \
        The reference must only refer to the number that comes in square brackets after passage. \
        Otherwise, do not use brackets in your answer and reference ONLY the number of the passage without mentioning the word passage. \
        If the documents can't answer the question or you are unsure say: 'The answer can't be found in the text'. \
        {new_line}\
        These are the documents:\
        {join(documents, delimiter=new_line, pattern=new_line+'Document[$idx]:'+new_line+'$content')}\
        {new_line}\
        Question: {query}\
        {new_line}\
        Answer:\
        {new_line}"
  - name: PromptNode
    type: PromptNode
    params:
      default_prompt_template: qa_template
      max_length: 400 # The maximum number of tokens the generated answer can have
      model_kwargs: # Specifies additional model settings
        temperature: 0 # Lower temperature works best for fact-based qa
      model_name_or_path: gpt-3.5-turbo
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 20 # Enables the sliding window approach
      language: en
      split_respect_sentence_boundary: True # Retains complete sentences in split documents

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
      - name: PromptNode
        inputs: [Reranker]
  - name: indexing
    nodes:
    # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures that this converter receives txt files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures that this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Document Search Pipelines

Document search pipelines, also called document retrieval pipelines, return whole documents as answers.

Semantic Document Search Pipeline

This is a sample pipeline that returns documents as answers. It searches for documents based on semantic similarity. It uses a vector-based search.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is a document search pipeline that searches for documents based on semantic similarity. It uses a vector-based search.
version: '1.20.0'
name: 'SemanticDocumentSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses one Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters
    type: FileTypeClassifier
  - name: TextConverter # Converts TXT files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      # Depending on the file type we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Keyword Document Search Pipeline

This pipeline is a good starting point for a document search pipeline. It returns documents as answers, based on keyword matches with your query. It uses the BM25 algorithm to prioritize the keywords.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# A pipeline for document search that uses a traditional, keyword-based retriever (using Elasticsearch's BM25 algorithm).
# It relies on matching keywords between query and document and is often a solid baseline to start with.
version: '1.20.0'
name: 'KeywordDocumentSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # This is the only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store
    type: BM25Retriever # The keyword-based retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits files into smaller documents and cleans them up
    type: PreProcessor
    params:
      # With a keyword-based retriever, you can keep slightly longer documents
      split_by: word # The unit by which you want to split the documents
      split_length: 500 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines: 
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Hybrid Document Search Pipeline

This pipeline combines the advantages of keyword-based and vector-based searches. Such a combination usually yields the best results without any training.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is a document search pipeline that combines vector-based and keyword-based searches. Such combination usually yields the best results without any training. 
version: '1.20.0'
name: 'HybridDocumentSearch'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: BM25Retriever # The keyword-based retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 20 # The number of results to return
  - name: EmbeddingRetriever # Selects the most relevant documents from the document store
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: JoinResults # Joins the results from both retrievers
    type: JoinDocuments
    params:
      join_mode: concatenate # Combines documents from multiple retrievers
  - name: Reranker # Uses a cross-encoder model to rerank the documents returned by the two retrievers
    type: SentenceTransformersRanker
    params:
      model_name_or_path: cross-encoder/ms-marco-MiniLM-L-6-v2 # Fast model optimized for reranking
      top_k: 20 # The number of results to return
      batch_size: 30  # Try to keep this number equal to or greater than the sum of the top_k of the two retrievers so all docs are processed at once
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, useful if you have different file types
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: BM25Retriever
        inputs: [Query]
      - name: EmbeddingRetriever
        inputs: [Query]
      - name: JoinResults
        inputs: [BM25Retriever, EmbeddingRetriever]
      - name: Reranker
        inputs: [JoinResults]
  - name: indexing
    nodes:
      # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter gets TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter gets PDF files
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: EmbeddingRetriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [EmbeddingRetriever]

Extractive Question Answering Pipelines

These pipelines return highlighted text passages as answers. They're good if you need to extract the answer from your documents and know the exact place where the answer is.

English Question Answering Pipeline

This is a good starting point for a question answering system. It uses a vector-based search and a reader node that highlights the answers in text passages.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.

# This is baseline Question Answering pipeline for English. It has a good, vector-based Retriever and a small, fast Reader.
version: '1.20.0'
name: 'QuestionAnswering_en'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
  - name: Retriever # Selects the most relevant documents from the document store and passes them on to the Reader
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: Reader # The component that actually fetches answers from among the 20 documents returned by retriever 
    type: FARMReader # Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/deberta-v3-large-squad2 # An optimized variant of BERT, a strong all-round model
      context_window_size: 700 # The size of the window around the answer span
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    nodes:
    # Depending on the file type, we use a Text or PDF converter
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

German Question Answering Pipeline

This pipeline is a good starting point. It uses a vector-based search and a German question answering model. It highlights the answers within text passages thanks to the use of a Reader node.

# If you need help with the YAML format, have a look at https://docs.cloud.deepset.ai/docs/create-a-pipeline#create-a-pipeline-using-yaml.
# This is a friendly editor that helps you create your pipelines with autosuggestions. To use them, press Control + Space on your keyboard.
# Whenever you need to specify a model, this editor helps you out as well. Just type your Hugging Face organization and a forward slash (/) to see available models.


# This is a baseline Question Answering pipeline for German. It uses a vector-based search and a German QA model.
version: '1.20.0'
name: 'QuestionAnswering_de'

# This section defines nodes that you want to use in your pipelines. Each node must have a name and a type. You can also set the node's parameters here.
# The name is up to you, you can give your component a friendly name. You then use components' names when specifying their order in the pipeline.
# Type is the class name of the component. 
components:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # The only supported document store in deepset Cloud
    params:
      similarity: cosine # Recommended similarity for paraphrase-multilingual-mpnet-base-v2
  - name: Retriever # Selects the most relevant documents from the document store and then passes them on to the Reader
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
      model_format: sentence_transformers
      scale_score: false
      top_k: 20 # The number of results to return
  # A "Reader" model that goes through those 20 candidate documents and identifies the exact answer
  - name: Reader # The component that actually fetches answers from the 20 documents returned by retriever    
    type: FARMReader # Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/gelectra-large-germanquad
      context_window_size: 700 # The size of the window around the answer span
  - name: FileTypeClassifier # Routes files based on their extension to appropriate converters, by default txt, pdf, md, docx, html
    type: FileTypeClassifier
  - name: TextConverter # Converts files into documents
    type: TextConverter
  - name: PDFConverter # Converts PDFs into documents
    type: PDFToTextConverter
  - name: Preprocessor # Splits documents into smaller ones and cleans them up
    type: PreProcessor
    params:
      # With a vector-based (dense) retriever, it's good to split your documents into smaller ones
      split_by: word # The unit by which you want to split the documents
      split_length: 250 # The max number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: de # Used by NLTK to best detect the sentence boundaries for that language

# Here you define how the nodes are organized in the pipelines
# For each node, specify its input
pipelines:
  - name: query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextConverter
        inputs: [FileTypeClassifier.output_1] # Ensures this converter receives TXT files
      - name: PDFConverter
        inputs: [FileTypeClassifier.output_2] # Ensures this converter receives PDFs
      - name: Preprocessor
        inputs: [TextConverter, PDFConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]