Create a Pipeline

Use a YAML file, a Jupyter notebook, or REST API to design and create your pipeline.

📘

You must be an Admin to perform this task.

About This Task

Each pipeline file defines two pipelines:

  • An indexing pipeline that defines how your files are preprocessed. Whenever you add a file, it is preprocessed by all deployed pipelines.
  • A query pipeline that describes how the query is run.

There are multiple ways to create a pipeline:

  • Using the guided workflow: Choose this way if you're new to pipelines and you'd like us to guide you through the process of creating one. Just tell us what you want to do and we'll create a pipeline that matches your goals best. You'll be able to use it right away.
  • Using YAML: Choose this way if you want to create your own pipeline from one of the ready-made templates or from scratch using a YAML editor. YAML editor comes with Pipeline Visualizer that shows you the pipeline structure as a diagram. This makes is easier to understand how data flows through your pipeline.
  • Using API: Choose this way if you already have a pipeline YAML file and want to programmatically upload it to deepset Cloud.

Create a Pipeline Using YAML

If you already know what your pipeline should look like or want to use one of the ready-made templates, that's the method for you. It's recommended that you have a basic understanding of YAML.

If you want to use a private model, you must host it on Hugging Face, and then connect deepset with Hugging Face. When creating a pipeline, copy the model name from Hugging Face and paste it in your pipeline. deepset Cloud will download and load the model.

Pipeline YAML Format

Your pipeline definition file should have the following format. Ensure that you follow the same indentation structure as in this example:

#This is not a complete pipeline! It's just an example to show you the pipeline format.
version: '1.22.0' #always set version to this value

components: #This part defines your pipeline components and their settings
  - name: MyPreprocessor #this is the name that you want to give to the pipeline node
    type: Preprocessor #this is the node type (class). For more information, see "Pipeline Nodes"
    params: #these are the node's settings
      split_by: passage
      split_length: 1
  - name: DocumentStore
    type: DeepsetCloudDocumentStore #currently only this document store type is supported
    # Continue until you define all components
 
#After you define all the components that you want to use, define your query and indexing pipelines:
pipelines:
  - name: query
    nodes: #here list the nodes that you want to use in this pipeline, each node must have a name and input
      - name: ESRetriever #this is the name of the node that you specified in the "components" section above
        inputs: [Query] #here you specify the input for this node, this is the name of the node that you specified in the components section
        #and you go on defining the nodes
        
    #Next, specify the indexing pipeline:
   - name: indexing
     nodes:
       - name: MyPreprocessor
         inputs: [File]
       - name: ESRetriever
       	 inputs: [MyPreprocessor]
         ...

Create a Pipeline

  1. Log in to deepset Cloud and go to Pipelines > Create Pipeline.
  2. Give your pipeline a name.
  3. Choose if you want to create a pipeline from scratch or use a template.
    There are pipeline templates available for different types of systems. All of them work out of the box but you can also use them as a starting point for your pipeline.
  4. In the components section of the YAML, configure all the nodes you want to use for indexing and query pipelines. Each node should have the following parameters:
    1. name - This is a custom name you give to the node.
    2. type - This is the node's class. You can check if in Pipeline Nodes if you're not sure.
    3. params - This section is the node's configuration. It lists the parameters for the node and their settings. If you don't configure any parameters, the node uses its default settings for the mandatory parameters. Here's an example:
      components:
      	- name: Retriever
        	type: EmbeddingRetriever
          params: 
          	document_store: DocumentStore
            embedding_model: intfloat/e5-base-v2
            model_format: sentence_transformers
            top_k: 10
      
  5. In the pipelines section, define your query and indexing pipelines:
    1. For the query pipeline, set the name to query.
    2. For the indexing pipeline, set the name to indexing.
    3. For each pipeline, add the nodes section to define the order of the nodes in your pipeline. Each node has a name (that's the custom name you gave it in the components section) and inputs (that's the name of the nodes whose input it takes for further processing. It can be one or more nodes.
      The input of the first node in the indexing pipeline is always File.
      The input of the first node in the query pipeline is always Query.
      Example:
      pipelines:
      	- name: query
        	nodes:
          	- name: BM25Retriever
              inputs: [Query] #Query is always the input of the first node in a query pipeline
            - name: EmbeddingRetriever
              inputs: [Query]
            - name: JoinResults
              inputs: [BM25Retriever, EmbeddingRetriever]
            - name: Reranker
              inputs: [JoinResults]
            - name: PromptNode
              inputs: [Reranker]
         - name: indexing
           nodes:
           	- name: FileTypeClassifier
              inputs: [File]
            - name: TextConverter
              inputs: [FileTypeClassifier.output_1] 
            - name: PDFConverter
              inputs: [FileTypeClassifier.output_2] 
            - name: Preprocessor
              inputs: [TextConverter, PDFConverter]
            - name: EmbeddingRetriever
              inputs: [Preprocessor]
            - name: DocumentStore
              inputs: [EmbeddingRetriever] 
      

Tip: Use ctrl + space to see autosuggestions. To see a list of available models, type the Hugging Face organization + / (slash).
Tip: To revert your changes to the last saved version, click Reset.

  1. Save your pipeline.
    deepset Cloud validates if your pipeline design is correct.
  2. To use your pipeline for search, you must first deploy it. Click Deploy.
An explained example of a pipeline

First, define the components that you want to use in your pipelines. For each component, specify its name, type, and any parameters that you want to use.

After you define your components, define your pipelines. For each pipeline, specify its name, type (either Query or Indexing), and the nodes that it consists of. For each node, specify its input.

version: '1.22.0'

components:   # This section defines nodes that we want to use in our pipelines
  - name: DocumentStore
    type: DeepsetCloudDocumentStore # This is the only supported document store
  - name: Retriever # Selects the most relevant documents from the document store and then passes them on to the Reader
    type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 # The number of results to return
  - name: Reader # The component that actually fetches answers            
    type: FARMReader # Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/roberta-large-squad2 # An optimized variant of BERT, a strong all-round model
      context_window_size: 700 # The size of the window around the answer span
  - name: TextFileConverter # Converts files to documents
    type: TextConverter
  - name: Preprocessor # Splits documents into smaller ones, and cleans them up
    type: PreProcessor
    params:
      split_by: word # The unit by which you want to split your documents
      split_length: 250 # The maximum number of words in a document
      split_overlap: 30 # Enables the sliding window approach
      split_respect_sentence_boundary: True # Retains complete sentences in split documents
      language: en # Used by NLTK to best detect the sentence boundaries for that language

pipelines: # Here you define the pipelines. For each component, specify its input.
  - name: query 
    nodes:
      - name: Retriever
        inputs: [Query] # The input for the first node is always a query
      - name: Reader
        inputs: [Retriever] # Input is the name of the component that you defined in the "component" section
  - name: indexing
    nodes:
      - name: TextFileConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextFileConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Create a Pipeline with REST API

This method works well if you have a pipeline YAML ready and want to upload it to deepset Cloud. You need to Generate an API Key first.

Follow the step-by-step code explanation:

Or use the following code:

curl --request POST \
     --url https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE>/pipelines \
     --header 'Accept: application/json' \
     --header 'Authorization: Bearer <YOUR_API_KEY>'\
     --data-binary "@path/to/pipeline.yaml"

See the REST API endpoint documentation.

What To Do Next

  • If you want to use your newly created pipeline for search, you must deploy it.
  • To view pipeline details, such as statistics or feedback, click the pipeline name. This opens the Pipeline Details page.
  • To let others test your pipeline, share your pipeline prototype.