Use Unstructured to Process Documents
Convert files to documents using the Unstructured API.
Unstructured provides tools to extract content from files and transform it into clean documents ready to be chunked and embedded. For a list of supported formats, see Unstructured documentation. You can use free Unstructured API or paid Unstructured Serverless API.
Prerequisites
You need an API key to your Unstructured account.
Use Unstructured
First, connect deepset AI Platform to Unstructured through the Integrations page. You can set up a connection for a single workspace or for the whole organization:
Add Workspace-Level Integration
- Click your profile icon and choose Settings.
- Go to Workspace>Integrations.
- Find the provider you want to connect and click Connect next to them.
- Enter the API key and any other required details.
- Click Connect. You can use this integration in pipelines and indexes in the current workspace.
Add Organization-Level Integration
- Click your profile icon and choose Settings.
- Go to Organization>Integrations.
- Find the provider you want to connect and click Connect next to them.
- Enter the API key and any other required details.
- Click Connect. You can use this integration in pipelines and indexes in all workspaces in the current organization.
Then, add the UnstructuredFileConverter component to your index.
Usage Examples
This is an example of an index that uses Unstructured API to process files:
components:
...
unstructured_converter:
type: haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter
init_parameters: {}
splitter:
type: deepset_cloud_custom_nodes.preprocessors.document_splitter.DeepsetDocumentSplitter
init_parameters:
split_by: word
split_length: 250
split_overlap: 30
respect_sentence_boundary: True
language: en
document_embedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: "intfloat/e5-base-v2"
writer:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
embedding_dim: 768
similarity: cosine
policy: OVERWRITE
connections: # Defines how the components are connected
- sender: unstructured_converter.documents
receiver: splitter.documents
- sender: splitter.documents
receiver: document_embedder.documents
- sender: document_embedder.documents
receiver: writer.documents
max_loops_allowed: 100
inputs: # Define the inputs for your index
files: "file_classifier.sources" # This component will receive the files to index as input
Updated 12 days ago