Pipeline Nodes

Nodes are the components that make up your pipeline. Choosing the right nodes for your pipeline is crucial to achieving the most relevant search results.

Nodes define how data flows through your pipeline. Some nodes have more than one type. For example, Retrievers can be keyword-based or vector-based. You can choose the type that best fits the task at hand. You can also specify parameters for your nodes to make them work exactly as you need.

When choosing a node for your pipeline, make sure it's optimal for the type of data you want to run your search on.

How Do The Nodes Combine?

To combine two nodes in a pipeline, the output of the first node must be the same as the input of the next node. For example, as in the picture below, TextConverter takes Files as input and returns Documents as output. You can combine it with PreProcessor because it takes Documents as input, so the output and input of these two nodes are compatible.

 a graphical representation of a two-step process. On the left, the process begins with "Files," indicated by a folder icon, which then flow through a series of three rightward-pointing, double chevron arrows towards a chip labeled "TextConverter." Following another series of three chevron arrows, the output is labeled "Documents," represented by a document icon. This output then flows through a final set of three chevron arrows toward another chip labeled "PreProcessor." The overall design is simple and uses a minimal color palette consisting of yellow, teal, grey, and dark blue. The image likely represents a software or digital process where files are converted to text documents and then preprocessed for further use. The mood is technical and informative.

When connecting the nodes, you pass the name of a compatible node as the input for the node that follows it. For example:

components:
  - name: Converter #here you give your node a name
    type: TextConverter
  - name: Processor
    type: PreProcessor
    params:
      .... 
      
 pipelines:
 - name: indexing 
    nodes:
      - name: Converter
      	inputs: [File]
      - name: Processor
      	inputs: [Converter] # this means Processor takes the output of Converter as its input

See a node's documentation page for information about compatible nodes. For instructions on creating pipelines, see Create a Pipeline.

Nodes Used in Indexing Pipelines for Processing Data

These are the nodes you can use to perform tasks on your data in an indexing pipeline:

  • CNAzureConverter
    Extracts text and tables from PDF, JPEG, PNG, MBP, and TIFF files using Microsoft Azure Form Recognizer. You must have an active Azure account and a Form Recognizer or Cognitive Services resource to use it.
  • EntityExtractor
    Extracts entities out of all documents in the Document Store and stores them in the documents' metadata.
  • FileTypeClassifier
    Useful if you have different types of files, for example PDF and TXT. It classifies the files by type and then routes them to appropriate file converters which further prepare them for search.
  • TextConverter
    Necessary in an indexing pipeline if you have TXT files. It converts them to Document objects that deepset Cloud pipelines search on.
  • PDFToTextConverter
    Necessary in an indexing pipeline if you have PDF files. It converts the files to Document objects that deepset Cloud pipelines search on.
  • PreProcessor
    Cleans and splits Documents into smaller chunks to make Readers' and Retrievers' work easier and faster. Used after file converters.
  • Vector-Based Retrievers
    Vector-based retrievers in indexing pipelines calculate vector representations (embeddings) of Documents and store these embeddings in DocumentStore.
Click here to see a flowchart combining these nodes into an indexing pipeline The image displays a flowchart diagram illustrating a document processing system. At the top, 'Files' are inputted into a 'FileTypeClassifier', which branches into two paths: one leading to 'Text and tables' through an 'AzureConverter', and another leading to 'PDF' through a 'PDFToTextConverter'. Both converters then connect to a 'PreProcessor', followed by a 'Retriever', and finally leading to a 'DocumentStore'. The process culminates in 'Documents' as the output at the bottom. The components are represented by icons resembling microchips, indicating a digital or automated process. The flowchart has a clean, minimalistic design with a color scheme of blues and greys on a white background.

See also Data Preparation with Pipeline Nodes.

Nodes Used in Query Pipelines

Here are all the nodes you can use in your query pipelines, grouped by their function.

Semantic Search Nodes

  • EntityExtractor
    Extracts entities from documents fetched by the Retriever and stores them in the documents' metadata.
  • Retriever
    Goes through the documents in the DocumentStore and fetches the ones that are most relevant to the query. You can use it on its own for document retrieval. It then returns whole documents as answers.
    You can combine it with a Reader for question answering to highlight the answer in the document.
  • Ranker
    Prioritizes documents based on the criteria you specify. For example, you can prioritize the newest documents.
  • Reader
    The core component that fetches the answers by highlighting them in the documents.
  • RetrievalScoreAdjuster
    Adjusts the scores Ranker or Retriever assigned to the retrieved documents.
Click here to see a flowchart of an extractive question answering pipeline

Extractive QA pipeline

Here's a basic extractive question answering pipeline:

a flowchart for a query processing system. At the top, there is a blue circle with a question mark labeled "Query," which signifies the start of the process where a question or search term is input. Below that, connected by a downward arrow, is a symbol resembling a microchip with "Retriever" written underneath. This represents a component that retrieves relevant information, likely from a database or set of documents. To the right of the Retriever, connected by a two-way arrow, is an icon of stacked disks labeled "DocumentStore," indicating where the data or documents are stored and from which the Retriever fetches information.  Another arrow leads from the Retriever down to a second microchip icon labeled "Reader," suggesting this component processes the retrieved information to understand and interpret it. Finally, the last arrow points to a green circle with a checkmark labeled "Answer," indicating the end of the process where a response or result is delivered based on the query. The image uses simple icons and a vertical flow to represent the sequence of steps in information retrieval and processing, suggesting a streamlined and systematic approach to answering queries.

RAG with a Ranker

Here's a RAG pipeline using a hybrid document search and a Ranker:

The image shows an updated flowchart of a query processing system. At the top, a speech bubble with a question mark, labeled "Query," leads to a "QueryClassifier," which bifurcates the process into "Natural language query" and "Keyword query." These queries are processed by a "Vector-Based Retriever" and a "Keyword-Based Retriever," respectively, both of which interface with a "DocumentStore."  From there, both paths converge at "JoinDocuments," suggesting a combination of the results from the two retrieval methods. This is followed by a "Ranker," which likely evaluates and orders the joined documents based on relevance or other criteria.  The process concludes at a "PromptNode," which may perform additional processing or formatting of the information, leading to the final "Generated Answer" represented by a checkmark, indicating the delivery of a successful response to the initial query.

Nodes Using LLMs

  • PromptNode
    A very versatile node that can perform a variety of NLP tasks using an LLM. Some examples are retrieval augmented generation (RAG) question answering, summarization, translation, and more. It comes with a bunch of out-of-the-box prompts you can use for most common tasks.
Click here for a flowchart of a basic RAG pipeline with a PromptNode This image is a flowchart depicting a simplified query processing sequence. At the top, there is a dark blue circular node with a white question mark, labeled "Query," representing the input of a question or search term. Directly below it, connected by a straight arrow, is an icon that resembles a microchip, labeled "Retriever." This represents the component in the system responsible for retrieving information from a data source.  On the right side, there's a horizontal, two-way arrow pointing to a stack of grey horizontal cylinders, representing a "DocumentStore" where data is archived. This illustrates the interaction between the Retriever and the database during the information retrieval process.  Moving downward from the Retriever, another arrow leads to a second microchip icon with the label "PromptNode" below it. This suggests an intermediary processing step, potentially involving additional data handling or query refinement.  The final element in the sequence is a downward arrow pointing to a green circular node with a white checkmark, labeled "Answer," indicating the delivery of a response or result to the initial query. The flowchart uses a linear, top-down design to show the path from a query to the final answer, implying a direct and orderly process.
Click here for a flowchart of a basic document search pipeline a flowchart diagram illustrating a search and retrieval system. At the top, there's a circular node labeled "Query" with a question mark, representing the start of the search process. Below this, two vertical pathways lead down to a pair of documents at the bottom. The left pathway has three nodes: the top and bottom nodes resemble computer chips labeled "Ranker" and "RetrievalScoreAdjuster," respectively, and both are connected by a dashed line to a central node labeled "Retriever," which is styled similarly. The right pathway is a simple arrow pointing from the "Query" node directly to a "Document Store," depicted as three stacked database cylinders. Both pathways converge at the "Documents," indicated by two overlapping pages. The color scheme is predominantly purple, teal, and grey, conveying a digital or technological theme.

Routing Nodes

  • QueryClassifier
    Distinguishes between keyword queries and natural language queries and routes them to the node that can handle them best. For example, you can use it to route keyword queries to a keyword-based retriever, like BM25Retriever, and natural language queries to a vector-based retriever, like EmbeddingRetriever.
Click here to see a flowchart of a RAG pipeline with QueryClassifier in a query pipeline The image displays a flowchart for a query processing system. At the top, there is a question mark inside a cloud, labeled "Query", representing the start where a user question is entered. Below this, a component labeled "QueryClassifier" branches into two types of queries: a "Natural language query" to the left and a "Keyword query" to the right.  For the natural language query, the flow moves to a "Vector-Based Retriever" represented by a microchip icon, suggesting an AI or machine learning model that processes the query in a semantic vector space.  For the keyword query, the path leads to a "Keyword-Based Retriever", also depicted by a microchip icon, which implies a more traditional search model based on specific keywords.  Both retrievers interact with a central "DocumentStore", symbolized by a database or data stack icon, indicating where the information is stored and from which both retrievers can fetch data.  The next step in the flowchart is "JoinDocuments", which seems to combine the results from both the vector-based and keyword-based retrievers.  Finally, the flow culminates in a "PromptNode", another microchip icon that likely processes the combined results to generate a response. The last icon is a circle with a check mark, labeled "Answer", indicating the end of the process where the answer is delivered.  The overall mood of the flowchart is clean and structured, using shades of blue, white, and green, which conveys a sense of technology and efficiency. The use of microchip icons for processing nodes adds to the tech-savvy theme of the diagram.

Utility Nodes

  • AnswerDeduplication
    In extractive question answering pipelines, used after the FARMReader to get rid of duplicate answers the Reader returns.
  • JoinDocuments
    Combines the output of two or more retrievers. Useful if you want to use a keyword-based and a dense retriever in one pipeline.
  • Shaper
    Modifies values by renaming them or changing their type. Used with PromptNode to ensure it receives or outputs a specific value.
  • ReferencePredictor
    Used in retrieval-augmented generation (RAG) pipelines to predict references of the answers the LLM generates.
  • ReturnError
    Attaches an error message to the answer's metadata and ends the pipeline. Frequently used in RAG pipelines as a branch where prompt injection attempts are redirected. On receiving a prompt injection, ReturnError stops the pipeline, ensuring the prompt never reaches the PromptNode.
  • InterleaveDocuments
    Interlaves documents coming from different retrievers into a single list. Used for pre-filtering documents for labeling.
  • SnowflakeExecutor
    Establishes a connection to a Snowflake database. This way, you can query your data in Snowflake with your deepset Cloud pipeline.
Click here for a flowchart of a RAG pipeline combining JoinDocuments and ReferencePredictor The image displays a flowchart of a query processing system using machine learning components. At the top, a question mark inside a speech bubble labeled "Query" feeds into a "QueryClassifier," which branches into two types of queries: "Natural language query" and "Keyword query." These queries interact with two different retrieval systems: a "Vector-Based Retriever" and a "Keyword-Based Retriever," respectively, both connected to a "DocumentStore."  The outputs of the retrievers are then fed into a component labeled "JoinDocuments," which suggests the merging or integration of the results from both retrieval methods. This integrated output passes to a "PromptNode," which likely processes or refines the combined documents for further use.  Finally, the data flows to a "ReferencePredictor," indicating a step where the system predicts references or relevant information, culminating in an "Answer" symbolized by a checkmark. This suggests the system's final output is a validated or verified response to the initial query.  The flowchart uses a clean, modern design with a color palette of white, various shades of blue, and green for the final checkmark, indicating a successful process. The machine learning components are represented by chip-like icons, suggesting advanced technology or artificial intelligence at work.
Click here for a flowchart of a RAG pipeline with ReturnError The image is a flowchart diagram representing a query processing system. At the top, there is a 'Query' symbolized by a question mark, which feeds into a 'QueryClassifier'. From the classifier, there are two pathways: one labeled 'Genuine query' that leads to a 'Retriever', which is bidirectionally connected to a 'DocumentStore', and the other labeled 'Prompt injection' leading to 'ReturnError'. The 'DocumentStore' also feeds into a 'Ranker', which then connects to a 'PromptNode'. The final output from the 'PromptNode' is a 'Generated Answer', indicated by a check mark. The diagram has a clean and simple design with a color scheme featuring shades of blue, white, and grey.
Click here for a flowchart of an extractive question answering pipeline with AnswerDeduplication The image is a vertical flowchart diagram that begins with a 'Query' symbolized by a question mark at the top. Below, the flow leads to a 'Retriever' chip icon that has a bidirectional arrow connecting to a 'DocumentStore', depicted as a stack of horizontal lines resembling database layers. Further down, the process continues to a 'Reader' chip icon, then to a step labeled 'AnswerDeduplication', and concludes with a 'Highlighted Answer', marked by a checkmark within a circle. The design is straightforward, utilizing a color palette of dark blue, teal, and grey, conveying a systematic approach to handling and processing a query.