Tokenization
Given an input text as a string, the first step of the embedding/reranking process is to dissect it into a list of tokens. This tokenization step is automatically performed on our servers when you call the API. We open-source the tokenizers so that you can preview the tokenized results and verify the number of tokens used by the API.
Voyage's Tokenizers on Hugging Face🤗
Voyage's tokenizers are available on Hugging Face🤗. You can access the tokenizer associated with a particular model using the following code:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('voyageai/voyage-3')
Update on Voyage tokenizer
Our earlier models, including embedding models
voyage-01
,voyage-lite-01
,voyage-lite-01-instruct
,voyage-lite-02-instruct
,voyage-2
,voyage-large-2
,voyage-code-2
,voyage-law-2
,voyage-large-2-instruct
, and reranking modelrerank-lite-1
, use the same tokenizer as Llama 2. However, our new models have adopted different tokenizers for optimized performance. Therefore, in the future, please specify the model you use when calling the tokenizer.
In our Python package, we provide two functions in voyageai.Client
which allow you to try the tokenizer before calling the API:
voyageai.Client.tokenize
(texts : List[str], model: str)
Parameters
- texts (List[str]) - A list of texts to be tokenized.
- model (str) - Name of the model to be tokenized for. For example,
voyage-3
,voyage-3-lite
,rerank-1
,rerank-lite-1
.Note
The "model" argument was added in June 2024. Our earlier models, including embedding modelsvoyage-01
,voyage-lite-01
,voyage-lite-01-instruct
,voyage-lite-02-instruct
,voyage-2
,voyage-large-2
,voyage-code-2
,voyage-law-2
,voyage-large-2-instruct
, and reranking modelrerank-lite-1
, used the same tokenizer. However, our new models have adopted different tokenizers.
Please specify the "model" when using this function. If "model" is unspecified, the old tokenizer will be loaded, which may produce mismatched results if you are using our latest models.
Returns
- A list of
tokenizers.Encoding
, each of which represents the tokenized results of an input text string.
Example
tokenized = vo.tokenize(texts, model="voyage-3")
for i in range(len(texts)):
print(tokenized[i].tokens)
['<s>', '▁The', '▁Mediter', 'rane', 'an', '▁di', 'et', '▁emphas', 'izes', '▁fish', ',', '▁o', 'live', '▁oil', ',', '▁...']
['<s>', '▁Ph', 'otos', 'yn', 'thesis', '▁in', '▁plants', '▁converts', '▁light', '▁energy', '▁into', '▁...']
['<s>', '▁', '2', '0', 'th', '-', 'century', '▁innov', 'ations', ',', '▁from', '▁rad', 'ios', '▁to', '▁smart', 'ph', 'ones', '▁...']
['<s>', '▁R', 'ivers', '▁provide', '▁water', ',', '▁ir', 'rig', 'ation', ',', '▁and', '▁habitat', '▁for', '▁...']
['<s>', '▁Apple', '’', 's', '▁conference', '▁call', '▁to', '▁discuss', '▁fourth', '▁fis', 'cal', '▁...']
['<s>', '▁Shakespeare', "'", 's', '▁works', ',', '▁like', "▁'", 'H', 'am', 'let', "'", '▁and', '▁...']
voyageai.Client.count_tokens
(texts : List[str], model: str)
Parameters
- texts (List[str]) - A list of texts to count the tokens for.
- model (str) - Name of the model to be counted for. For example,
voyage-3
,voyage-3-lite
,rerank-1
,rerank-lite-1
.Note
The "model" argument was added in June 2024. Our earlier models, including embedding modelsvoyage-01
,voyage-lite-01
,voyage-lite-01-instruct
,voyage-lite-02-instruct
,voyage-2
,voyage-large-2
,voyage-code-2
,voyage-law-2
,voyage-large-2-instruct
, and reranking modelrerank-lite-1
, used the same tokenizer. However, our new models have adopted different tokenizers.
Please specify the "model" when using this function. If "model" is unspecified, the old tokenizer will be loaded, which may produce mismatched results if you are using our latest models.
Returns
- The total number of tokens in the input texts, as an integer.
total_tokens = vo.count_tokens(texts, model="voyage-3")
print(total_tokens)
86
Our embedding models have context length limits. If your text exceeds the limit, you would need to truncate the text before calling the API, or specify the truncation
argument so that we can do it for you.
Tokens, words, and characters
Modern NLP models typically convert a text string into a list of tokens. Frequent words, such as "you" and "apple," will be tokens by themselves. In contrast, rare or long words will be broken into multiple tokens, e.g., "uncharacteristically" is dissected into four tokens, "▁un", "character", "ist", and "ically". One word roughly corresponds to 1.2 - 1.5 tokens on average, depending on the complexity of the domain. The tokens produced by our tokenizer have an average of 5 characters, suggesting that you could roughly estimate the number of tokens by dividing the number of characters in the text string by 5. To determine the exact number of tokens, please use the
count_tokens()
function.
tiktoken
tiktoken
is the open-source version of OpenAI's tokenizer. Voyage models use different tokenizers, which can be accessed from Hugging Face🤗. Therefore, our tokenizer may generate a different list of tokens for a given text compared totiktoken
. Statistically, the number of tokens produced by our tokenizer is on average 1.1 - 1.2 times that oftiktoken
. To determine the exact number of tokens, please use thecount_tokens()
function.
Updated 1 day ago