Use Unstructured to Process Documents

Convert files to documents using the Unstructured API.

Unstructured provides tools to extract content from files and transform it into clean documents ready to be chunked and embedded. For a list of supported formats, see Unstructured documentation. You can use free Unstructured API or paid Unstructured Serverless API.

Prerequisites

You need an API key to your Unstructured account.

Use Unstructured

First, connect deepset AI Platform to Unstructured through the Integrations page. You can set up a connection for a single workspace or for the whole organization:

Add Workspace-Level Integration

  1. Click your profile icon and choose Settings.
  2. Go to Workspace>Integrations.
  3. Find the provider you want to connect and click Connect next to them.
  4. Enter the API key and any other required details.
  5. Click Connect. You can use this integration in pipelines and indexes in the current workspace.

Add Organization-Level Integration

  1. Click your profile icon and choose Settings.
  2. Go to Organization>Integrations.
  3. Find the provider you want to connect and click Connect next to them.
  4. Enter the API key and any other required details.
  5. Click Connect. You can use this integration in pipelines and indexes in all workspaces in the current organization.

Then, add the UnstructuredFileConverter component to your index.

Usage Examples

This is an example of an index that uses Unstructured API to process files:

components:
  ...
  unstructured_converter:
    type: haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter
    init_parameters: {}

  splitter:
    type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
    init_parameters:
      split_by: word
      split_length: 250
      split_overlap: 30
      respect_sentence_boundary: True
      language: en

  document_embedder:
    type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
    init_parameters:
      model: "intfloat/e5-base-v2"

  writer:
    type: haystack.components.writers.document_writer.DocumentWriter
    init_parameters:
      document_store:
        type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
        init_parameters:
          embedding_dim: 768
          similarity: cosine
      policy: OVERWRITE

connections:  # Defines how the components are connected
- sender: unstructured_converter.documents
  receiver: splitter.documents
- sender: splitter.documents
  receiver: document_embedder.documents
- sender: document_embedder.documents
  receiver: writer.documents

max_loops_allowed: 100

inputs:  # Define the inputs for your index
  files: "file_classifier.sources"  # This component will receive the files to index as input