Indexes

Indexes preprocess your files preparing them for search and store them in a document store of your choice. You can reuse indexes across your query pipelines. Learn how it works.

What's an Index

An index is a data structure that provides your pipelines with fast, efficient access to your large-scale datasets. Think of it as a book index that helps you find specific topics without reading the entire book. It's crucial for fast, efficient search, making it possible to handle large-scale datasets.

How Indexes Work

Indexes work on the files uploaded to deepset AI Platform. Indexes use configurable components connected together with each one performing a single task on your files, such as conversion, cleaning, or splitting. They clean the files and chunk them into smaller passages called documents. Then, they write the cleaned documents in a database called document store from which the pipeline retrieves them at query time.

Indexes are optional, meaning that if you have a query pipeline that queries files in an existing database, like Snowflake, or uses the model's knowledge, you don't need an index.

Indexes are specific to a workspace. Deleting a workspace deletes all the indexes in this workspace.

Documents

Document is a Haystack data class with specific properties you can access. One file may produce multiple documents. Documents inherit metadata from files. For details, see Haystack documentation for data classes.

Document Stores

Document store is where your index writes the data and where your pipelines can then access them. When setting up a pipeline, you can connect it to a document store and choose an index directly from the Document Store component card.

For details, see Document Stores.

Indexing

The files in deepset AI Platform are indexed once when an index is enabled. New files uploaded after you enable an index are indexed individually and added to the enabled index.

Core and Integration Indexes

Core indexes write files into the OpenSearchDocumentStore, which is the core document store of deepset AI Platform. This means deepset manages its infrastructure, authorization, and has access to the indexing information.

Integration indexes use one of the integrated document stores, such as Pinecone, Weaviate, or others. For these document stores, you must manage the infrastructure yourself. deepset AI Platform also doesn't have access to the indexing information. For details, see Document Stores.

Indexes and Pipelines

To run searches on files in deepset AI Platform, a pipeline must be connected to an index. You do this by adding a document store to a query pipeline and choosing the index you want this document store to use.

The document store card with an index selected

Multiple pipelines can use a single index; one pipeline can use multiple indexes. An index must be enabled to be used in a query pipeline.

Building Indexes

deepset AI Platform provides a set of curated and maintained index templates for various file types. You can use one of the templates to build your index or you can start from scratch.

Indexes are built the same way as query pipelines. You simply drag components to the canvas in Pipeline Builder and then connect them so that the output of the preceding component matches the input type of the next component.

For details, check How do pipelines work in Pipelines.

Inputs

Indexes always start with files as input. When in Pipeline Builder, add the FilesInput component at the beginning of an index.

The FilesInput component from the Inputs group in Studio added as the first component of an indexing pipeline

Outputs

Indexes return a list of Document objects as output, usually written into the document store by the DocumentWriter component, which is often the last component in an index.

Body

The index body defines what happens with the files you uploaded to deepset. It's up to you how you want to process your files. For hints and best practices, see PreProcessing Data with Pipeline Components.

Enabling Indexes

To start indexing, you enable an index from the Indexes page. An enabled index is in view-only mode, so you can't edit it. To make changes, you disable the index first. However, you can't disable an index that's used by a deployed pipeline. In this case, you can duplicate the index and update the copy.

To deploy a query pipeline, you must enable all indexes used by this pipeline.

Using Indexes in Query Pipelines

Indexes are linked to a document store, where documents are written and stored. Pipelines that work with your data include Retrievers, which fetch data from the document store. When setting up a Retriever, you need to connect it to a document store. After that, you can select an index from the document store card. This index is the one your query pipeline will use to search and retrieve data from the document store.

To use multiple indexes in a single query pipeline, create a separate document store for each index. Then, assign a retriever to each document store, since a retriever can only be connected to one document store at a time.

The Indexes Page

The Indexes page lists all indexes that exist in a workspace and their status. There are two tabs on the page:

  • Active, for enabled indexes
  • Drafts, for indexes that were saved but not yet enabled

The Index Details Page

Click an index name to open the Index Details Page where you can check:

  • The status of files this index processes
  • Details of pipelines connected to this index
  • Index logs

Index Status

  • Not indexed: The pipeline is being deployed, but the files have not yet been indexed
  • Indexing: Your files are being indexed. You can see how many files have already been indexed if you hover your mouse over the Indexing label.
  • Indexed: Your pipeline is deployed, all the files are indexed, and you can use your pipeline for search.
  • Partially indexed: At least one of the files wasn't indexed. This may be a problem with your file or a component in the pipeline. You can still run a search if at least some files were indexed. Check the index logs for details.