Tokenization

Voyage uses the same tokenizer as Llama 2. Given an input text as a string, the first step of the embedding process is to dissect it into a list of tokens. This step is automatically performed on our servers when you call the API.

📘

Voyage's Tokenizer on Hugging Face🤗

Voyage's tokenizer is available on Hugging Face🤗. You can access it using the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('voyageai/voyage')

In our Python package, we provide two functions in voyageai.Client which allow you to preview the tokenized results before calling the API:

voyageai.Client.tokenize (texts : List[str])

Parameters

  • texts (List[str]) - A list of texts to be tokenized. Currently, all Voyage embedding models use the same tokenizer.

Returns

  • A list of tokenizers.Encoding, each of which represents the tokenized results of an input text string.

Example

tokenized = vo.tokenize(texts)
for i in range(len(texts)):
    print(tokenized[i].tokens)
['<s>', '▁The', '▁Mediter', 'rane', 'an', '▁di', 'et', '▁emphas', 'izes', '▁fish', ',', '▁o', 'live', '▁oil', ',', '▁...']
['<s>', '▁Ph', 'otos', 'yn', 'thesis', '▁in', '▁plants', '▁converts', '▁light', '▁energy', '▁into', '▁...']
['<s>', '▁', '2', '0', 'th', '-', 'century', '▁innov', 'ations', ',', '▁from', '▁rad', 'ios', '▁to', '▁smart', 'ph', 'ones', '▁...']
['<s>', '▁R', 'ivers', '▁provide', '▁water', ',', '▁ir', 'rig', 'ation', ',', '▁and', '▁habitat', '▁for', '▁...']
['<s>', '▁Apple', '’', 's', '▁conference', '▁call', '▁to', '▁discuss', '▁fourth', '▁fis', 'cal', '▁...']
['<s>', '▁Shakespeare', "'", 's', '▁works', ',', '▁like', "▁'", 'H', 'am', 'let', "'", '▁and', '▁...']

voyageai.Client.count_tokens (texts : List[str])

Parameters

  • texts (List[str]) - A list of texts to count the tokens for.

Returns

  • The total number of tokens in the input texts, as an integer.
total_tokens = vo.count_tokens(texts)
print(total_tokens)
86

Our embedding models have context length limits. If your text exceeds the limit, you would need to truncate the text before calling the API, or specify the truncation argument so that we can do it for you.

📘

Tokens, words, and characters

Modern NLP models typically convert a text string into a list of tokens. Frequent words, such as "you" and "apple," will be tokens by themselves. In contrast, rare or long words will be broken into multiple tokens, e.g., "uncharacteristically" is dissected into four tokens, "▁un", "character", "ist", and "ically". One word roughly corresponds to 1.2 - 1.5 tokens on average, depending on the complexity of the domain. The tokens produced by our tokenizer have an average of 5 characters, suggesting that you could roughly estimate the number of tokens by dividing the number of characters in the text string by 5. To determine the exact number of tokens, please use the Client.count_tokens() function.

📘

tiktoken

tiktoken is the open-source version of OpenAI's tokenizer. Voyage uses a different tokenizer, which is the same as Llama 2. Therefore, our tokenizer may generate a different list of tokens for a given text compared to tiktoken. Statistically, the number of tokens produced by our tokenizer is on average 1.1 - 1.2 times that of tiktoken. To determine the exact number of tokens, please use the Client.count_tokens() function.