Multimodal Embeddings
Multimodal embedding models transform unstructured data from multiple modalities into a shared vector space. Voyage multimodal embedding models support text and content-rich images — such as figures, photos, slide decks, and document screenshots — eliminating the need for complex text extraction or ETL pipelines. Unlike traditional multimodal models like CLIP, which process text and images separately, Voyage multimodal embedding models can directly vectorize inputs containing interleaved text + images. The architecture of CLIP also prevents it from being usable in mixed-modality searches, as text and image vectors often align with irrelevant items of the same modality. Voyage multimodal embedding models eliminate this bias by processing all inputs through a single backbone.
Model Choices
Voyage currently provides the following multimodal embedding models:
Model | Context Length (tokens) | Embedding Dimension | Description |
---|---|---|---|
voyage-multimodal-3 | 32,000 | 1024 | Rich multimodal embedding model that can vectorize interleaved text and content-rich images, such as screenshots of PDFs, slides, tables, figures, and more. See blog post for details. |
Python API
Voyage multimodal embeddings are accessible in Python through the voyageai
package. Please install the voyageai
package, set up the API key, and use the voyageai.Client.multimodal_embed()
function to vectorize your inputs.
voyageai.Client.multimodal_embed
(inputs : List[dict] or List[List[Union[str, PIL.Image.Image]]], model : str, input_type : Optional[str] = None, truncation : Optional[bool] = True)
Parameters
-
inputs (List[dict] or List[List[Union[str, PIL.Image.Image]]]) - A list of multimodal inputs to be vectorized.
-
Each input is a sequence of text and images, which can be represented in either of the following two ways:
(1) A list containing text strings and/or PIL image objects (List[Union[str, PIL.Image.Image]]), where each image is an instance of the Pillow Image class. For example:
["This is a banana.", PIL.Image.open("banana.jpg")]
.PIL Image Object
Pillow is a widely used Python library for image processing. In the above example,PIL.Image.open()
opens an image file and returns a PIL Image. Please see our FAQ for more details about Pillow.(2) A dictionary that contains a single key
"content"
, whose value represents a sequence of text and images. The dictionary schema is identical to that of an input in theinputs
parameter of the REST API. -
The following constraints apply to the
inputs
list:- The list must not contain more than 1000 inputs.
- Each image must not contain more than 16 million pixels or be larger than 20 MB in size.
- With every 560 pixels of an image counts as a token, each input in the list must not exceed 32,000 tokens, and the total number of tokens across all inputs must not exceed 320,000.
-
-
model (str) - Name of the model. Currently, the only supported model is
voyage-multimodal-3
. -
input_type (str, optional, defaults to
None
) - Type of the input. Options:None
,query
,document
.- When
input_type
isNone
, the embedding model directly converts theinputs
into numerical vectors. For retrieval/search purposes, where a "query", which can be text or image in this case, is used to search for relevant information among a collection of data referred to as "documents," we recommend specifying whether yourinputs
are intended as queries or documents by settinginput_type
toquery
ordocument
, respectively. In these cases, Voyage automatically prepends a prompt to yourinputs
before vectorizing them, creating vectors more tailored for retrieval/search tasks. Sinceinputs
can be multimodal, "queries" and "documents" can be text, images, or an interleaving of both modalities. Embeddings generated with and without theinput_type
argument are compatible. - For transparency, the following prompts are prepended to your input.
- For
query
, the prompt is "Represent the query for retrieving supporting documents: ". - For
document
, the prompt is "Represent the document for retrieval: ".
- For
- When
-
truncation (bool, optional, defaults to
True
) - Whether to truncate the inputs to fit within the context length.- If
True
, an over-length input will be truncated to fit within the context length before being vectorized by the embedding model. If the truncation happens in the middle of an image, the entire image will be discarded. - If
False
, an error will be raised if any input exceeds the context length.
- If
Returns
- A
MultimodalEmbeddingsObject
, containing the following attributes:- embeddings (List[List[float]]) - A list of embeddings for the corresponding list of inputs, where each embedding is a vector represented as a list of floats.
- text_tokens (int) - The total number of text tokens in the list of inputs.
- image_pixels (int) - The total number of image pixels in the list of inputs.
- total_tokens (int) - The combined total of text and image tokens. Every 560 pixels counts as a token.
Example
import voyageai
import PIL
vo = voyageai.Client()
# This will automatically use the environment variable VOYAGE_API_KEY.
# Alternatively, you can use vo = voyageai.Client(api_key="<your secret key>")
# Example input containing a text string and PIL image object
inputs = [
["This is a banana.", PIL.Image.open('banana.jpg')]
]
# Vectorize inputs
result = vo.multimodal_embed(inputs, model="voyage-multimodal-3")
print(result.embeddings)
[
[0.027587890625, -0.021240234375, 0.018310546875,...]
]
REST API
Voyage multimodal embeddings can be accessed by calling the endpoint POST https://api.voyageai.com/v1/multimodalembeddings
. Please refer to the Multimodal Embeddings API Reference for the specification and an example.
TypeScript Library
Voyage multimodal embeddings are accessible in TypeScript through the Voyage TypeScript Library, which exposes all the functionality of our multimodal embeddings endpoint (see Multimodal Embeddings API Reference).
Updated 8 days ago