ImageFileToDocument
Convert references to image files into empty Document objects with associated metadata.
ImageFileToDocument doesn't extract any content from the image files. Instead, it creates Document objects with None as their content and attaches metadata such as the file path and any user-provided values. Use it in pipelines where image file paths must be wrapped in Document objects so that downstream components can process them.
Key Features
- Wraps image file paths in
Documentobjects for downstream processing. - Supports file paths (str or Path) and
ByteStreamobjects as input. - Optional metadata attachment to resulting documents.
- Works with image embedding components like
SentenceTransformersDocumentEmbedder(with an image-capable model) and content extraction components likeLLMDocumentContentExtractor.
Configuration
- Drag the
ImageFileToDocumentcomponent onto the canvas from the Component Library. - Click on the component to open the configuration panel.
- Configure the component settings:
- Set Store Full Path to control whether the full file path or just the file name is stored in document metadata.
Connections
ImageFileToDocument accepts a list of file paths or ByteStream objects through its sources input. It outputs a list of Document objects with empty content and file path metadata.
It typically connects with:
FilesInput: receives image file paths.SentenceTransformersDocumentEmbedder(with a CLIP model) orLLMDocumentContentExtractor: sends document references for embedding or content extraction.DocumentWriter: sends documents for storage after embedding.
Source Code
To check this component's source code, open file_to_document.py in the Haystack repository.
Usage Examples
Basic Configuration
image_file_to_document:
type: haystack.components.converters.image.file_to_document.ImageFileToDocument
init_parameters:
store_full_path: true
Using the Component in an Index
Here's an example of ImageFileToDocument used in an index. It converts image file paths into Document objects that can then be embedded and stored for later retrieval:
# haystack-pipeline
components:
image_file_to_document:
type: haystack.components.converters.image.file_to_document.ImageFileToDocument
init_parameters:
store_full_path: true
SentenceTransformersImageDocumentEmbedder:
type: haystack.components.embedders.sentence_transformers_document_embedder.SentenceTransformersDocumentEmbedder
init_parameters:
model: clip-ViT-B-32
device:
token:
prefix: ''
suffix: ''
batch_size: 32
progress_bar: true
normalize_embeddings: false
meta_fields_to_embed:
embedding_separator: "\\n"
DocumentWriter:
type: haystack.components.writers.document_writer.DocumentWriter
init_parameters:
document_store:
type: haystack_integrations.document_stores.opensearch.document_store.OpenSearchDocumentStore
init_parameters:
hosts:
index: image-index
max_chunk_bytes: 104857600
embedding_dim: 512
return_embedding: false
method:
mappings:
settings:
create_index: true
http_auth:
use_ssl:
verify_certs:
timeout:
policy: OVERWRITE
connections:
- sender: image_file_to_document.documents
receiver: SentenceTransformersImageDocumentEmbedder.documents
- sender: SentenceTransformersImageDocumentEmbedder.documents
receiver: DocumentWriter.documents
inputs:
files:
- image_file_to_document.sources
max_runs_per_component: 100
metadata: {}
Parameters
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
sources | List[Union[str, Path, ByteStream]] | List of image file paths or ByteStream objects to convert. | |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, their meta is added to the output documents. |
Outputs
| Parameter | Type | Description |
|---|---|---|
documents | List[Document] | A list of Document objects with empty content and associated metadata. |
Init Parameters
These are the parameters you can configure in Pipeline Builder:
| Parameter | Type | Default | Description |
|---|---|---|---|
store_full_path | bool | False | If True, stores the full path of the file in the metadata of the document. If False, stores only the file name. |
Run Method Parameters
These are the parameters you can configure for the component's run() method. This means you can pass these parameters at query time through the API, in Playground, or when running a job. For details, see Modify Pipeline Parameters at Query Time.
| Parameter | Type | Default | Description |
|---|---|---|---|
sources | List[Union[str, Path, ByteStream]] | List of image file paths or ByteStream objects to convert. | |
meta | Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] | None | Optional metadata to attach to the documents. This value can be a list of dictionaries or a single dictionary. If it's a single dictionary, its content is added to the metadata of all produced documents. If it's a list, its length must match the number of sources as they're zipped together. For ByteStream objects, their meta is added to the output documents. |
Was this page helpful?