AWS Marketplace Model Package
AWS Marketplace model packages are containerized solutions that include the model and inference code, designed to be deployed in a customer account and virtual private cloud (VPC). Model packages offer several benefits to customers:
- Data flow and access control. Because models are deployed into their account and VPC, customers maintain full control over data flow and API access, effectively addressing data privacy risks associated with third-party or multi-tenant serving.
- Reliability and compliance backed by AWS. AWS will be the customer sub-processor, and customers inherit all the reliability and compliance guarantees of AWS.
- Billing and payment through AWS. By transacting through a marketplace listing, customers can utilize their existing AWS billing information and credits to procure Voyage models. This streamlined process eliminates the need to manage a separate third-party payment and billing system.
A model package can be deployed in two ways — as a real-time inference API endpoint or a batch transform job. A real-time inference API endpoint is a persistent, fully managed API endpoint designed for request-by-request inference. In contrast, a batch transform job is a finite execution process intended for bulk inference on a dataset, with predictions written to a file. In both cases, the model packages are running on some underlying hardware, such as GPUs, which are called AWS instances.
Pricing. Pricing for the use of a Voyage model package consists of software pricing and infrastructure pricing, both at an hourly rate. Software pricing covers the cost of model usage, while your total hourly cost is the sum of the software pricing (e.g., $5.71 per hour for voyage-multilingual-2
) and the infrastructure pricing (e.g., $1.408 per hour for a single ml.g5.xlarge). Pricing rates vary based on deployment type (i.e., real-time inference API endpoint versus batch transform job), instance type, and region. All Voyage AI models come with a free trial.
Below, we present a short tutorial on subscribing to and deploying the models. Then, we will discuss advanced deployment options and provide information on latency and throughput.
Model Package Subscription
You will need the following AWS identity access management (IAM) permission to subscribe to an AWS Marketplace listing. To add them, sign into your AWS account console and see this page for instructions.
- AmazonSageMakerFullAccess (AWS Managed Policy)
- aws-marketplace:ViewSubscriptions
- aws-marketplace:Subscribe
- aws-marketplace:Unsubscribe
To subscribe via the AWS Marketplace:
-
Select the Voyage model package you would like to subscribe to in the AWS Marketplace.
-
Click Continue to Subscribe.
-
Click Accept Offer.
-
Confirm you have successfully subscribed (see confirmation toast in the figure below). You can now safely close the window.
You can also confirm and manage your AWS Marketplace subscriptions through the console’s manage subscription page. You can cancel your subscription at any time, but note that canceling your subscription does not terminate your existing real-time inference endpoints or batch transform jobs (see Delete Real-Time Inference Endpoints).
Model package deployment requires specific SageMaker instances (e.g., ml.g5.xlarge). The exact quota names for these instances end with “endpoint usage” and “transform job usage” (e.g., “ml.g5.2xlarge for endpoint usage”, and “ml.g5.2xlarge for transform job usage”. These quotas are often set to zero by default. Please go to the SageMaker Service Quotas console to request quota increases if needed.
Model Package Deployment
This section covers the mechanics of how to deploy a model package using Amazon SageMaker Studio and our example Jupyter Notebooks.
Amazon SageMaker Studio
Amazon SageMaker Studio is a web-based interface for ML and AI development that includes a hosted notebook environment already authenticated to your AWS account. You can skip this section if you have another preferred Jupyter notebook execution environment, such as your local machine, and you know how to properly authenticate to your AWS account from that environment. Follow the SageMaker documentation to first launch SageMaker Studio, and then launch a JupyterLab environment.
Jupyter Notebook
We provide an example Jupyter notebook to get started with Python using the AWS SDK (Boto3) and the Amazon SageMaker Python SDK. You can clone the notebook into SageMaker Studio, or your preferred Jupyter notebook execution environment, by cloning the Voyage AI AWS repo (i.e., git clone https://github.com/voyage-ai/voyageai-aws.git
).
Alternatively, you can directly download the notebook from GitHub and move it to your notebook execution environment (e.g., upload it to SageMaker Studio).
Once accessible to SageMaker Studio or your preferred execution environment, you can open the notebook and follow the steps in it to deploy the models.
Delete Real-Time Inference Endpoints
Be careful to not have real-time inference endpoints running unnecessarily. They will incur wasteful costs, potentially leading to unexpected charges. If you are using a provided example Jupyter notebook, be sure to run the “Clean-up” section, which deletes the endpoint and associated endpoint configuration. You can manage and delete endpoints through SageMaker Studio or the SageMaker console (see here for instructions).
Advanced Deployment
The Jupyter notebook described in the previous section is meant to get you started and build your understanding of deploying model packages. However, we’d like to make you aware of several other ways to deploy model packages, such as CloudFormation, the SageMaker Console, and the AWS CLI, though the details are beyond the scope of this documentation. These alternative methods may better suit the existing production workflow you have: CloudFormation for declarative infrastructure specification, SageMaker Console for interactive UI-based deployment, and AWS CLI for programmatic shell orchestration. A subscribed listing Configure and launch page provides instructions and resources for the aforementioned deployment methods, which you can return to by:
-
Returning to the product listing page for your subscribed model of interest from the AWS Marketplace.
-
Clicking on Continue to Subscribe (upper right), which will take you to the Subscribe to this software page.
-
In the Subscribe to this software page, you should notice an “Already Subscribed” indicator. Click on the Continue to configuration button (upper right).
-
You are now back at the Configure and launch page. You can select your desired launch method, and the user interface will change with the appropriate instructions and resources for that method.
Latency and Throughput
The following table provides representative latency and throughput numbers for a deployed real-time inference endpoint running on an ml.g6.xlarge.
Model | Latency | Throughput |
---|---|---|
voyage-3 | 75 ms for a single query with at most 200 tokens. | 57M tokens per hour. |
voyage-3-lite | 20 ms for a single query with at most 200 tokens. | 182M tokens per hour. |
voyage-large-2 voyage-large-2-instruct voyage-code-2 voyage-law-2 voyage-multilingual-2 voyage-finance-2 | 90 ms for single query with at most 100 tokens. 185 ms for 500 tokens. | 12.6M tokens per hour. |
voyage-2 | 75 ms for single query with at most 200 tokens 90 ms for 500 tokens. | 36M tokens per hour. |
rerank-2 | For 25K tokens: 1.5 s (1 GPU), 415 ms (4 GPUs), and 245 ms (8 GPUs). | 60M tokens per hour. |
rerank-2-lite | For 25K tokens: 565 ms (1 GPU), 170 ms (4 GPUs), and 120 ms (8 GPUs). | 160M tokens per hour. |
rerank-lite-1 | For 25K tokens: 445 ms (1 GPU), 135 ms (4 GPUs), and 90 ms (8 GPUs). | 202M tokens per hour. |
If you need assistance subscribing and deploying a Voyage model package from the AWS Marketplace, please send an email to [email protected] or join our Discord.
Updated about 2 months ago