Create a Pipeline

Use a YAML file, a Jupyter notebook, or REST API to design and create your pipeline. Coming soon is also the option to create a pipeline using a visualizer.

📘

You must be an Admin to perform this task.

About This Task

Currently, deepset Cloud supports two pipeline types: question answering and document retrieval.

Each pipeline file defines two pipelines:

  • An indexing pipeline that describes how your files are preprocessed.
    Whenever you add a file, it is preprocessed by all deployed pipelines.
  • A query pipeline that describes how the query is run.

Create a Pipeline Using YAML

If you already know what your pipeline should look like or want to use one of the ready-made templates, that's the method for you. It's recommended that you have a basic understanding of YAML.

If you want to use a private model, you must host it on Hugging Face, and then connect deepset with Hugging Face. When creating a pipeline, copy the model name from Hugging Face and paste it in your pipeline. deepset Cloud will download and load the model.

Pipeline YAML Format

Your pipeline definition file should have the following format. Ensure that you follow the same indentation structure as in this example:

#This is not a complete pipeline! It's just an example to show you the pipeline format.
version: "1.10.0" #always set version to this value
name: "pipeline_name" #here, specify the name for your pipeline

components:
  - name: MyPreprocessor #this is the name that you want to give to the pipeline node
    type: Preprocessor #this is the node type (class). For more information, see "Pipeline Nodes"
    params: 
      split_by: passage
      split_length: 1
  - name: DocumentStore
    type: DeepsetCloudDocumentStore #currently only this document store type is supported
 
#After you define all the components that you want to use, define your query and indexing pipelines:
pipelines:
  - name: query
    nodes: #here list the nodes that you want to use in this pipeline, each node must have a name and input
      - name: ESRetriever #this is the name of the node that you specified in the "components" section above
        inputs: [Query] #here you specify the input for this node
        #and you go on defining the nodes
        
    #Next, specify the indexing pipeline:
   - name: indexing
     nodes:
       - name: MyPreprocessor
         inputs: [File]
         ...

Create a Pipeline

  1. Log in to deepset Cloud and go to Pipelines > Create Pipeline.
  2. In YAML Editor, click Create Pipeline.
  3. Select if you want to use a template or create a pipeline from scratch.
    There are pipeline templates available for question answering and document retrieval systems. You can select a template as a starting point for your pipeline and update it to suit your needs.
  4. Give your pipeline a name that doesn't contain spaces.
  5. Specify the query and indexing pipelines as well as their components.
    Tip: Use ctrl + space to see autosuggestions. To see a list of available models, type the Hugging Face organization + / (slash).
    Tip: To revert your changes to the last saved version, click Reset.
  6. Save your pipeline.
    deepset Cloud validates if your pipeline design is correct.
  7. To use your pipeline for search, you must first deploy it. Click Deploy.
An explained example of a pipeline

First, define the components that you want to use in your pipelines. For each component, specify its name, type, and any parameters that you want to use.

After you define your components, define your pipelines. For each pipeline, specify its name, type (either Query or Indexing), and the nodes that it consists of. For each node, specify its input.

version: '1.10.0'
name: en_QA_pipeline

components:   # This section defines nodes that we want to use in our pipelines
  - name: DocumentStore
    type: DeepsetCloudDocumentStore #this is the only supported document store
  - name: Retriever #selects the most relevant documents from the document store and then passes them on to the Reader
    type: EmbeddingRetriever #uses a Transformer model to encode the document and the query
    params:
      document_store: DocumentStore
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1 #model optimized for semantic search
      model_format: sentence_transformers
      top_k: 20 #the number of results to return
  - name: Reader #the component that actually fetches answers            
    type: FARMReader #Transformer-based reader, specializes in extractive QA
    params:
      model_name_or_path: deepset/roberta-large-squad2 #an optimized variant of BERT, a strong all-round model
      context_window_size: 700 #the size of the window around the answer span
  - name: TextFileConverter #converts files to documents
    type: TextConverter
  - name: Preprocessor #splits documents into smaller ones, and cleans them up
    type: PreProcessor
    params:
      split_by: word #the unit by which you want to split your documents
      split_length: 250 #the maximum number of words in a document
      split_overlap: 30 #enables the sliding window approach
      split_respect_sentence_boundary: True #retains complete sentences in split documents
      language: en #used by NLTK to best detect the sentence boundaries for that language

pipelines: #Here you define the pipelines. For each component, specify its input.
  - name: query 
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    nodes:
      - name: TextFileConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [TextFileConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Create a Pipeline with Jupyter Notebooks

If you want to experiment to find the right nodes for your pipeline, this is the method for your. Don't worry if you're not an expert in Python, we created a Notebooks template to walk you through all the steps.

  1. Log in to deepset Cloud and go to Pipelines > Create Pipeline.
  2. Under Jupyter Notebook, click Create Pipeline.
  3. Select the type of server that you want to use:
    • CPU: this server is enough if you just want to work with your pipelines, without necessarily running them.
    • GPU: this server ensures decent speed, it's recommended if you want to run your pipelines.
      The server type that you choose stays the same until it is active (it gets deactivated after an hour) or until you stop it. To stop the server, in Jupyter Lab go to File>Hub Control Panel.
  4. When the server is successfully created, click Create a Notebook.
  5. If you want to use the template, double-click the examples folder and open the 01_getting_started_sdk.ipynb file.
  6. Follow the Notebook template for hints, adjusting it to your own needs. These are the steps that you must take:
    1. Import the necessary components, such as the Pipeline object, pipeline nodes, and document store that you want to use.
    2. Set up the API key as an environment variable (DEEPSET_CLOUD_API_KEY). See Generate the API key.
    3. Import your file corpus to deepset Cloud. These are the files your pipeline will search. For instructions how to import files, see Work with Your Data in SDK.
    4. Create a pipeline. First, give your pipeline a name that doesn't contain spaces. Then, define all the nodes that you want to use and their parameters. After that, define the order of the nodes in the pipeline. You can find an example pipeline in the Notebook template.
    5. Save your pipeline to deepset Cloud using save_to_deepset_cloud(). For more information, see Pipeline Methods.

Sharing Your Pipeline

You can share your notebook with your organization. This way, your colleagues get access to your code and can help you troubleshoot or enhance your work.

To share a notebook, drag it to the folder named your organization - shared, for example deepset-shared. All members of your organization can access and edit the notebooks that are in the shared folder.

Create a Pipeline with REST API

This method works well if you have a pipeline YAML ready and want to upload it to deepset Cloud. You need to Generate an API Key first.

Use the following code:

curl --request POST \
     --url https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE>/pipelines \
     --header 'Accept: application/json' \
     --header 'Authorization: Bearer <YOUR_API_KEY>'\
     --data-binary "@path/to/pipeline.yaml"

See also the REST API endpoint documentation.

What To Do Next

If you want to use your newly created pipeline for search, you must deploy it.


Related Links

For sample pipelines, see: