Multimodal embedding models

post

https://api.voyageai.com/v1/multimodalembeddings

The Voyage multimodal embedding endpoint returns vector representations for a given list of multimodal inputs consisting of text, images, or an interleaving of both modalities.

Important: Starting December 8, 2025, the following constraints apply to all URL parameters (e.g., image_url)

Limit the number of redirects.

Require that responses include a content-length header.

Respect robots.txt to prevent unauthorized scraping.

Body Params

inputs

array

required

A list of multimodal inputs to be vectorized.

A single input in the list is a dictionary containing a single key "content", whose value represents a sequence of text, images, and videos.

The value of "content" is a list of dictionaries, each representing a single piece of text or image. The dictionaries have four possible keys:
1. type: Specifies the type of the piece of the content. Allowed values are text, image_url, image_base64, video_url, or video_base64.
2. text: Only present when type is text. The value should be a text string.
3. image_base64: Only present when type is image_base64. The value should be a Base64-encoded image in the data URL format data:[<mediatype>];base64,<data>. Currently supported mediatypes are: image/png, image/jpeg, image/webp, and image/gif.
4. image_url: Only present when type is image_url. The value should be a URL linking to the image. We support PNG, JPEG, WEBP, and GIF images. The following constraints apply to the URL:
  - Limit the number of redirects.
  - Require that responses include a content-length header.
  - Respect robots.txt to prevent unauthorized scraping.
5. video_base64: Only present when type is video_base64. The value should be a Base64-encoded video in the data URL format data:[<mediatype>];base64,<data>. Currently supported mediatypes are: video/mp4.
6. video_url: Only present when type is video_url. The value should be a URL linking to the video. We support MP4 videos. The following constraints apply to the URL:
  - Limit the number of redirects.
  - Require that responses include a content-length header.
  - Respect robots.txt to prevent unauthorized scraping.

Note: Only one of the keys, base64 or url, should be present in each dictionary for image and video data. Consistency is required within a request, meaning each request should use either image_base64/video_base64 or image_url/video_url exclusively, not both.

Example payload where inputs contains an image as a URL

The inputs list contains a single input, which consists of a piece of text and an image (which is provided via a URL).


      {
        "inputs": [
          {   
            "content": [
              {   
                "type": "text",
                "text": "This is a banana."
              },
              {   
                "type": "image_url",
                "image_url": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg"
              },
              {
                "type": "video_url",
                "video_url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/360/Big_Buck_Bunny_360_10s_1MB.mp4"
              }
            ]   
          }   
        ],  
        "model": "voyage-multimodal-3.5"
      }

Example payload where inputs contains a Base64 image

Below is an equivalent example to the one above where the image content is a Base64 image instead of a URL. (Base64 images can be lengthy, so the example only shows a shortened version.)

  
      {
        "inputs": [
          {   
            "content": [
              {   
                "type": "text",
                "text": "This is a banana."
              },
              {   
                "type": "image_base64",
                "image_base64": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAA..."
              },
              {
                "type": "video_base64",
                "video_base64": "data:video/mp4;base64,AAAAIGZ0eXBpc29tAAACA..."
              }  
            ]   
          }   
        ],  
        "model": "voyage-multimodal-3.5"
      }

The following constraints apply to the inputs list:

The list must not contain more than 1,000 inputs.
Each image must not contain more than 16 million pixels or be larger than 20 MB in size.
Each video must not be larger than 20 MB in size.
With every 560 pixels of an image and every 1120 pixels of a video being counted as a token, each input in the list must not exceed 32,000 tokens, and the total number of tokens across all inputs must not exceed 320,000.

inputs*

model

string

required

Name of the model. Recommended options: voyage-multimodal-3.5, voyage-multimodal-3.

input_type

string | null

enum

Defaults to null

Type of the input. Defaults to null. Other options: query, document.

When input_type is null, the embedding model directly converts the inputs into numerical vectors. For retrieval/search purposes, where a "query", which can be text or image in this case, is used to search for relevant information among a collection of data referred to as "documents," we recommend specifying whether your inputs are intended as queries or documents by setting input_type to query or document, respectively. In these cases, Voyage automatically prepends a prompt to your inputs before vectorizing them, creating vectors more tailored for retrieval/search tasks. Since inputs can be multimodal, "queries" and "documents" can be text, images, or an interleaving of both modalities. Embeddings generated with and without the input_type argument are compatible.
For transparency, the following prompts are prepended to your input.

For query, the prompt is "Represent the query for retrieving supporting documents: ".
For document, the prompt is "Represent the document for retrieval: ".

Allowed:

truncation

boolean

Defaults to true

Whether to truncate the inputs to fit within the context length. Defaults to true.

If true, an over-length input will be truncated to fit within the context length before being vectorized by the embedding model. If the truncation happens in the middle of an image, the entire image will be discarded.
If false, an error will be raised if any input exceeds the context length.

output_encoding

string | null

enum

Defaults to null

Format in which the embeddings are encoded. Defaults to null.

If null, the embeddings are represented as a list of floating-point numbers.
If base64, the embeddings are represented as a Base64-encoded NumPy array of single-precision floats.

Allowed:

Responses

5XX

Server Error

This indicates our servers are experiencing high traffic or having an unexpected issue. Please see our Error Codes guide.

200Success

4XXClient error This indicates an issue with the request format or frequency. Please see our Error Codes guide.