You must be an Admin to perform this task.
Each pipeline file defines two pipelines:
- An indexing pipeline that defines how your files are preprocessed. Whenever you add a file, it is preprocessed by all deployed pipelines.
- A query pipeline that describes how the query is run.
There are multiple ways to create a pipeline:
- Using the guided workflow: Choose this way if you're new to pipelines and you'd like us to guide you through the process of creating one. Just tell us what you want to do and we'll create a pipeline that matches your goals best. You'll be able to use it right away.
- Using YAML: Choose this way if you want to create your own pipeline from one of the ready-made templates or from scratch using a YAML editor. YAML editor comes with Pipeline Visualizer that shows you the pipeline structure as a diagram. This makes is easier to understand how data flows through your pipeline.
- Using API: Choose this way if you already have a pipeline YAML file and want to programmatically upload it to deepset Cloud.
If you already know what your pipeline should look like or want to use one of the ready-made templates, that's the method for you. It's recommended that you have a basic understanding of YAML.
If you want to use a private model, you must host it on Hugging Face, and then connect deepset with Hugging Face. When creating a pipeline, copy the model name from Hugging Face and paste it in your pipeline. deepset Cloud will download and load the model.
Your pipeline definition file should have the following format. Ensure that you follow the same indentation structure as in this example:
#This is not a complete pipeline! It's just an example to show you the pipeline format. version: '1.22.0' #always set version to this value components: #This part defines your pipeline components and their settings - name: MyPreprocessor #this is the name that you want to give to the pipeline node type: Preprocessor #this is the node type (class). For more information, see "Pipeline Nodes" params: #these are the node's settings split_by: passage split_length: 1 - name: DocumentStore type: DeepsetCloudDocumentStore #currently only this document store type is supported # Continue until you define all components #After you define all the components that you want to use, define your query and indexing pipelines: pipelines: - name: query nodes: #here list the nodes that you want to use in this pipeline, each node must have a name and input - name: ESRetriever #this is the name of the node that you specified in the "components" section above inputs: [Query] #here you specify the input for this node, this is the name of the node that you specified in the components section #and you go on defining the nodes #Next, specify the indexing pipeline: - name: indexing nodes: - name: MyPreprocessor inputs: [File] - name: ESRetriever inputs: [MyPreprocessor] ...
- Log in to deepset Cloud and go to Pipelines > Create Pipeline.
- Give your pipeline a name.
- Choose if you want to create a pipeline from scratch or use a template.
There are pipeline templates available for different types of systems. All of them work out of the box but you can also use them as a starting point for your pipeline.
- In the
componentssection of the YAML, configure all the nodes you want to use for indexing and query pipelines. Each node should have the following parameters:
name- This is a custom name you give to the node.
type- This is the node's class. You can check if in Pipeline Nodes if you're not sure.
params- This section is the node's configuration. It lists the parameters for the node and their settings. If you don't configure any parameters, the node uses its default settings for the mandatory parameters. Here's an example:
components: - name: Retriever type: EmbeddingRetriever params: document_store: DocumentStore embedding_model: intfloat/e5-base-v2 model_format: sentence_transformers top_k: 10
- In the
pipelinessection, define your query and indexing pipelines:
- For the query pipeline, set the
- For the indexing pipeline, set the
- For each pipeline, add the
nodessection to define the order of the nodes in your pipeline. Each node has a
name(that's the custom name you gave it in the
inputs(that's the name of the nodes whose input it takes for further processing. It can be one or more nodes.
The input of the first node in the indexing pipeline is always
The input of the first node in the query pipeline is always
pipelines: - name: query nodes: - name: BM25Retriever inputs: [Query] #Query is always the input of the first node in a query pipeline - name: EmbeddingRetriever inputs: [Query] - name: JoinResults inputs: [BM25Retriever, EmbeddingRetriever] - name: Reranker inputs: [JoinResults] - name: PromptNode inputs: [Reranker] - name: indexing nodes: - name: FileTypeClassifier inputs: [File] - name: TextConverter inputs: [FileTypeClassifier.output_1] - name: PDFConverter inputs: [FileTypeClassifier.output_2] - name: Preprocessor inputs: [TextConverter, PDFConverter] - name: EmbeddingRetriever inputs: [Preprocessor] - name: DocumentStore inputs: [EmbeddingRetriever]
- For the query pipeline, set the
Tip: Use ctrl + space to see autosuggestions. To see a list of available models, type the Hugging Face organization + / (slash).
Tip: To revert your changes to the last saved version, click Reset.
- Save your pipeline.
deepset Cloud validates if your pipeline design is correct.
- To use your pipeline for search, you must first deploy it. Click Deploy.
An explained example of a pipeline
First, define the components that you want to use in your pipelines. For each component, specify its name, type, and any parameters that you want to use.
After you define your components, define your pipelines. For each pipeline, specify its name, type (either Query or Indexing), and the nodes that it consists of. For each node, specify its input.
version: '1.22.0' components: # This section defines nodes that we want to use in our pipelines - name: DocumentStore type: DeepsetCloudDocumentStore # This is the only supported document store - name: Retriever # Selects the most relevant documents from the document store and then passes them on to the Reader type: EmbeddingRetriever # Uses a Transformer model to encode the document and the query params: document_store: DocumentStore embedding_model: intfloat/e5-base-v2 # Model optimized for semantic search model_format: sentence_transformers top_k: 20 # The number of results to return - name: Reader # The component that actually fetches answers type: FARMReader # Transformer-based reader, specializes in extractive QA params: model_name_or_path: deepset/roberta-large-squad2 # An optimized variant of BERT, a strong all-round model context_window_size: 700 # The size of the window around the answer span - name: TextFileConverter # Converts files to documents type: TextConverter - name: Preprocessor # Splits documents into smaller ones, and cleans them up type: PreProcessor params: split_by: word # The unit by which you want to split your documents split_length: 250 # The maximum number of words in a document split_overlap: 30 # Enables the sliding window approach split_respect_sentence_boundary: True # Retains complete sentences in split documents language: en # Used by NLTK to best detect the sentence boundaries for that language pipelines: # Here you define the pipelines. For each component, specify its input. - name: query nodes: - name: Retriever inputs: [Query] # The input for the first node is always a query - name: Reader inputs: [Retriever] # Input is the name of the component that you defined in the "component" section - name: indexing nodes: - name: TextFileConverter inputs: [File] - name: Preprocessor inputs: [TextFileConverter] - name: Retriever inputs: [Preprocessor] - name: DocumentStore inputs: [Retriever]
This method works well if you have a pipeline YAML ready and want to upload it to deepset Cloud. You need to Generate an API Key first.
Follow the step-by-step code explanation:
Or use the following code:
curl --request POST \ --url https://api.cloud.deepset.ai/api/v1/workspaces/<YOUR_WORKSPACE>/pipelines \ --header 'Accept: application/json' \ --header 'Authorization: Bearer <YOUR_API_KEY>'\ --data-binary "@path/to/pipeline.yaml"
See the REST API endpoint documentation.
Updated 4 days ago