About Pipelines

Pipelines contain the processing stages needed to execute a query and index your files. These stages are pipeline components, also called nodes, that are connected in series so that the output of one node is used by the next node in the pipeline.

How Do Pipelines Work?

An icon showing nodes joined together in a pipeline.An icon showing nodes joined together in a pipeline.

Pipelines define how data flows through its nodes to achieve the best search results. For example, a basic pipeline can be made up of a retriever and a reader. The retriever goes through all the documents you want to use for your search and selects the most relevant ones to the query. Then, the reader uses the documents selected by the retriever and highlights the word, phrase, sentence, or paragraph that answers your query.

Nodes are like building blocks that you can mix and match or replace. They can be connected as a Directed Acyclic Graph (DAG), thus allowing for more complex workflows, such as decision nodes or having the output of multiple nodes combined.

Pipelines run on the files you add to deepset Cloud and turn them into documents. A document is a piece of text stored in the document store. Multiple documents may come from one file.

deepset Cloud currently supports two types of pipelines: question answering and information retrieval.

Indexing and Query Pipelines

To run a search in deepset Cloud, you must define two pipelines in your pipeline file:

  • A query pipeline that contains a recipe for how to execute a query.
An image of a query pipeline where on the left there is a question \"What is the capital of Sudan\" and then there are arrows indicating that the question goes through three nodes connected with arrows. Then these nodes communicate with the document store and fetch the answer: KhartoumAn image of a query pipeline where on the left there is a question \"What is the capital of Sudan\" and then there are arrows indicating that the question goes through three nodes connected with arrows. Then these nodes communicate with the document store and fetch the answer: Khartoum
  • An indexing pipeline that defines how you want to preprocess your files before running a search on them.
An image of an indexing pipeline. It starts with the icon representing a file on the left, which then moves through the nodes of the pipeline and ends up as a document in the document storeAn image of an indexing pipeline. It starts with the icon representing a file on the left, which then moves through the nodes of the pipeline and ends up as a document in the document store

In deepset Cloud, you define the indexing and query pipelines in one file, which you later deploy to use for search.

When you deploy your pipeline, it indexes the files, turns them into documents, and stores them in the document store from where they're retrieved at the time of the search. The exact steps involved in indexing depend on the retrieval method you choose.
Your files are indexed once; they don't get indexed every time a pipeline runs. If you add a new file after you deploy your pipeline, only this file is indexed. The same is true for conversion. If you're using a Converter node in your pipeline, it converts the files only once; it doesn't convert them every time you run your search.

Example of an indexing and a query pipeline
version: '1.8.0'
name: "my_sample_pipeline"

components:    # define all the nodes that make up your pipeline:
  - name: DocumentStore
    type: DeepsetCloudDocumentStore
  - name: Retriever
    type: ElasticsearchRetriever
    params:
      document_store: DocumentStore    # params can reference other Components defined in the YAML
      top_k: 20
  - name: Reader       # custom-name for the component; helpful for visualization & debugging (coming soon)
    type: FARMReader    # Haystack class name for the Component
    params:
      model_name_or_path: deepset/roberta-base-squad2-distilled
      context_window_size: 500
      return_no_answer: true
  - name: TextFileConverter
    type: TextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      split_by: word
      split_length: 250
      language: en # Specify the language of your documents

pipelines:
# this is the query pipeline:
  - name: query    # a sample extractive-qa Pipeline
    type: Query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
# this is the indexing pipeline:
  - name: indexing
    type: Indexing
    nodes:
      - name: TextFileConverter
        inputs: [File]
      - name: Preprocessor
        inputs: [ TextFileConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Pipeline Nodes

Nodes are the components that make up your pipeline. Choosing the right nodes for your pipeline is crucial to achieving the most relevant search results. Check the nodes available and find out about their super powers.

Each node has different types and each type was designed with a particular task in mind. For example, if you are looking for a retrieval method that doesn't need a neural network for indexing, you can use ElasticsearchRetriever (BM25). You can also specify parameters for your nodes to make them work exactly as you need.

When choosing a node for your pipeline, ensure that:

  • It is optimal for the type of data that you want to run your search on
  • It is compatible with the datastore that you want to use
  • It is already supported by deepset Cloud

This table lists some of the nodes that you can use in your search system:

Node

Available types (classes)

What's it best for?

FileTypeClassifier

Only one type

Routes files with different extensions to appropriate file converters. Useful if you have different types of files.

TextConverter

Only one type

Converts a file to a document object.

PDFToTextConverter

Only one type

Converts a PDF file to plain text.

PreProcessor

Only one type

Cleans files and splits them into documents.

Dealing with long documents can be a problem for some nodes. Long documents slow down the reader. Also, dense retrievers can only read about 500 words of a document. Use a preprocessor to get around it.

Retriever

ElasticsearchRetriever (BM25)
ElasticsearchFilterOnlyRetriever
TfidfRetriever
EmbeddingRetriever
DensePassageRetriever (DPR)

Filters documents from the document store to retrieve a collection of documents relevant to the query.

When combined with a reader, it speeds up a query.

When used on its own, returns whole documents as answers.

Reader

FARMReader
TransformersReader

The core component that fetches the right answers.

Use a reader if you want your answers highlighted.

Additionally, deepset Cloud currently supports the following document stores:

The Pipelines Page

All the pipelines created by your organization are listed on the Pipelines page. The pipelines listed under Deployed are the ones that you can run your search with. The pipelines under In Development are drafts that you must deploy before you can use them for your search.

Pipeline Status

When you deploy a pipeline, it changes its status as follows:

  • Not indexed: The pipeline is being deployed, but the files have not yet been indexed
  • Indexing: Your files are being indexed. You can see how many files have already been indexed if you hover your mouse over the _Indexing _label.
  • Indexed: Your pipeline is deployed, all the files are indexed, and you can use your pipeline for search.
  • Indexing failed: The files could not be indexed. This may be an NLP-related problem, a problem with your file, or a Node in the pipeline.
  • Unhealthy: It's a temporary state. Your pipeline was not deployed, and the files are not indexed. You can wait a couple of seconds for this to resolve.
  • Failed to deploy: It's a fatal state. Your pipeline was not deployed, and your files are not indexed. For ideas on how to fix it, see Troubleshoot Pipeline Deployment.

Did this page help you?