Tokenization

Given an input text as a string, the first step of the embedding/reranking process is to dissect it into a list of tokens. This tokenization step is automatically performed on our servers when you call the API. We open-source the tokenizers so that you can preview the tokenized results and verify the number of tokens used by the API.

📘

Voyage's Tokenizers on Hugging Face🤗

Voyage's tokenizers are available on Hugging Face🤗. You can access the tokenizer associated with a particular model using the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('voyageai/voyage-large-2-instruct')
Update on Voyage tokenizer

Our earlier models, including embedding models voyage-01, voyage-lite-01, voyage-lite-01-instruct, voyage-lite-02-instruct, voyage-2, voyage-large-2, voyage-code-2, voyage-law-2, voyage-large-2-instruct, and reranking model rerank-lite-1, use the same tokenizer as Llama 2. However, our new models have adopted different tokenizers for optimized performance. Therefore, in the future, please specify the model you use when calling the tokenizer.

In our Python package, we provide two functions in voyageai.Client which allow you to try the tokenizer before calling the API:

voyageai.Client.tokenize (texts : List[str], model: str)

Parameters

  • texts (List[str]) - A list of texts to be tokenized.
  • model (str) - Name of the model to be tokenized for. For example, voyage-large-2-instruct, voyage-large-2, rerank-1, rerank-lite-1.
    Note
    The "model" argument was added in June 2024. Our earlier models, including embedding models voyage-01, voyage-lite-01, voyage-lite-01-instruct, voyage-lite-02-instruct, voyage-2, voyage-large-2, voyage-code-2, voyage-law-2, voyage-large-2-instruct, and reranking model rerank-lite-1, used the same tokenizer. However, our new models have adopted different tokenizers.

    Please specify the "model" when using this function. If "model" is unspecified, the old tokenizer will be loaded, which may produce mismatched results if you are using our latest models.

Returns

  • A list of tokenizers.Encoding, each of which represents the tokenized results of an input text string.

Example

tokenized = vo.tokenize(texts, model="voyage-large-2-instruct")
for i in range(len(texts)):
    print(tokenized[i].tokens)
['<s>', '▁The', '▁Mediter', 'rane', 'an', '▁di', 'et', '▁emphas', 'izes', '▁fish', ',', '▁o', 'live', '▁oil', ',', '▁...']
['<s>', '▁Ph', 'otos', 'yn', 'thesis', '▁in', '▁plants', '▁converts', '▁light', '▁energy', '▁into', '▁...']
['<s>', '▁', '2', '0', 'th', '-', 'century', '▁innov', 'ations', ',', '▁from', '▁rad', 'ios', '▁to', '▁smart', 'ph', 'ones', '▁...']
['<s>', '▁R', 'ivers', '▁provide', '▁water', ',', '▁ir', 'rig', 'ation', ',', '▁and', '▁habitat', '▁for', '▁...']
['<s>', '▁Apple', '’', 's', '▁conference', '▁call', '▁to', '▁discuss', '▁fourth', '▁fis', 'cal', '▁...']
['<s>', '▁Shakespeare', "'", 's', '▁works', ',', '▁like', "▁'", 'H', 'am', 'let', "'", '▁and', '▁...']

voyageai.Client.count_tokens (texts : List[str], model: str)

Parameters

  • texts (List[str]) - A list of texts to count the tokens for.
  • model (str) - Name of the model to be counted for. For example, voyage-large-2-instruct, voyage-large-2, rerank-1, rerank-lite-1.
    Note
    The "model" argument was added in June 2024. Our earlier models, including embedding models voyage-01, voyage-lite-01, voyage-lite-01-instruct, voyage-lite-02-instruct, voyage-2, voyage-large-2, voyage-code-2, voyage-law-2, voyage-large-2-instruct, and reranking model rerank-lite-1, used the same tokenizer. However, our new models have adopted different tokenizers.

    Please specify the "model" when using this function. If "model" is unspecified, the old tokenizer will be loaded, which may produce mismatched results if you are using our latest models.

Returns

  • The total number of tokens in the input texts, as an integer.
total_tokens = vo.count_tokens(texts, model="voyage-large-2-instruct")
print(total_tokens)
86

Our embedding models have context length limits. If your text exceeds the limit, you would need to truncate the text before calling the API, or specify the truncation argument so that we can do it for you.

📘

Tokens, words, and characters

Modern NLP models typically convert a text string into a list of tokens. Frequent words, such as "you" and "apple," will be tokens by themselves. In contrast, rare or long words will be broken into multiple tokens, e.g., "uncharacteristically" is dissected into four tokens, "▁un", "character", "ist", and "ically". One word roughly corresponds to 1.2 - 1.5 tokens on average, depending on the complexity of the domain. The tokens produced by our tokenizer have an average of 5 characters, suggesting that you could roughly estimate the number of tokens by dividing the number of characters in the text string by 5. To determine the exact number of tokens, please use the count_tokens() function.

📘

tiktoken

tiktoken is the open-source version of OpenAI's tokenizer. Voyage models use different tokenizers, which can be accessed from Hugging Face🤗. Therefore, our tokenizer may generate a different list of tokens for a given text compared to tiktoken. Statistically, the number of tokens produced by our tokenizer is on average 1.1 - 1.2 times that of tiktoken. To determine the exact number of tokens, please use the count_tokens() function.