Multimodal Embeddings
Multimodal embedding models transform unstructured data from multiple modalities into a shared vector space. Voyage multimodal embedding models support text and content-rich images -- such as figures, photos, slide decks, and document screenshots -- eliminating the need for complex text extraction or ETL pipelines. Unlike traditional multimodal models like CLIP that process text and images separately, Voyage multimodal embedding models can directly vectorize inputs containing interleaved text + images and exhibit near-perfect performance on mixed modality searches.
Model Choices
Voyage currently provides the following multimodal embedding models.
Model | Context Length (tokens) | Embedding Dimension | Description |
---|---|---|---|
voyage-multimodal-3 | 32000 | 1024 | Rich multimodal embedding model that handles images, interleaved text and image, document screenshots, slides and chart understanding. |
Python API
Voyage multimodal embeddings are accessible in Python through the voyageai
package. Please first install the voyageai
package and set up the API key, and use the multimodal_embed()
function of voyageai.Client
to embed your inputs.
voyageai.Client.multimodal_embed
(inputs : List[dict] or List[List[Union[str, PIL.Image.Image]]], model : str, input_type : Optional[str] = None, truncation : Optional[bool] = True)
Parameters
-
inputs (List[dict] or List[List[Union[str, PIL.Image.Image]]]) - A list of multimodal inputs to be vectorized.
-
An input is a sequence of texts and images, which can be represented by the following two ways.
(1) A list of text strings and/or PIL image objects (List[Union[str, PIL.Image.Image]]), where each image is an instance of the Pillow Image class. For example:
["This is a banana.", PIL.Image.open("banana.jpg")]
.PIL Image Object
Pillow is a widely used Python library for image processing. In the above example,PIL.Image.open()
opens an image file and returns a PIL Image. Please see our FAQ for more details about Pillow.(2) A dictionary that contains a single key
"content"
, whose value represents a sequence of texts and images. The format is exactly the same as in the inputs in the REST API . -
The following constraints apply to the inputs list:
- The maximum length of the list is 1000.
- Each image contains no more than 16 million pixels.
- Each input contains no more than 32,000 tokens, with 560 pixels counted as a token.
-
-
model (str) - Name of the model. Currently, the only supported model is
voyage-multimodal-3
. -
input_type (str, optional, defaults to
None
) - Type of the input. Default toNone
. Other options:query
,document
.- When input_type is set to
None
, the input data will be directly encoded by our embedding model. Alternatively, when the inputs are queries or documents, the users can specify input_type to bequery
ordocument
, respectively. In such cases, Voyage will prepend a prompt to the input data and send the extended inputs to the embedding model. Since inputs can be multimodal, queries and documents can be text, images, or an interleaving of both modalities. - For retrieval/search use cases, we recommend specifying this argument when encoding queries or documents to enhance retrieval quality. Embeddings generated with and without the input_type argument are compatible.
- For transparency, the prompts the backend will prepend to your input are below.
- For query, the prompt is "Represent the query for retrieving supporting documents: ".
- For document, the prompt is "Represent the document for retrieval: ".
- When input_type is set to
-
truncation (bool, optional, defaults to
True
) - Whether to truncate the input tokens to fit within the context length.- If
True
, over-length input will be truncated to fit within the context length, before vectorized by the embedding model. If the truncation happens in the middle of an image, the entire image will be discarded. - If
False
, an error will be raised if any given input exceeds the context length.
- If
Returns
- A
MultimodalEmbeddingsObject
, containing the following attributes:- embeddings (List[List[float]]) - A list of embeddings for the corresponding list of inputs, where each embedding is a vector represented as a list of floats.
- text_tokens (int) - The total number of text tokens in inputs.
- image_pixels (int) - The total number of image pixels in inputs.
- total_tokens (int) - The combined total of text and image tokens, the latter of which is calculated by converting 560 pixels into a token.
Example
import voyageai
from PIL import Image
vo = voyageai.Client()
# This will automatically use the environment variable VOYAGE_API_KEY.
# Alternatively, you can use vo = voyageai.Client(api_key="<your secret key>")
# Example of multimodal data as a list of text and PIL Image
inputs = [
["This is a banana.", Image.open('banana.jpg')]
]
# Embed document
result = vo.multimodal_embed(inputs, model="voyage-multimodal-3", input_type="document")
print(result.embeddings)
[
[0.0281982421875, -0.008056640625, 0.018798828125, ...]
]
REST API
Voyage multimodal embeddings can be accessed by calling the endpoint POST https://api.voyageai.com/v1/multimodalembeddings
. Please refer to the Multimodal Embeddings API Reference for the specification.
TypeScript Library
Voyage multimodal embeddings are accessible in TypeScript through the Voyage TypeScript Library, which exposes all the functionality of our multimodal embeddings endpoint (see Multimodal Embeddings API Reference).
Updated about 18 hours ago