Discussions

Ask a Question
Back to All

Unicode character support

We're evaluating using voyage to generate embeddings for later use in semantic search, but it appears that at least some unicode characters are not supported and cause the embedding API call to return 400 errors: {"detail":"There was an error parsing the body"}, for example Pyrex Spring Blossom Light Green Beaded Edge Nested Mixing Bowl 402 1 ½ Qt fails but Pyrex Spring Blossom Light Green Beaded Edge Nested Mixing Bowl 402 1 Qt succeeds. The same code framework succeeds for both inputs using gemini's text embedding API, so the basic handling of HTTP requests, content types and encodings is correct (the input is JSON encoded in utf-8, which is the assumed encoding for type application/json, and declaring it explicitly in the content-type does not change the behaviour).

Does the embedding API require some transformation of the input text to restrict it to an allowed subset of unicode, or am I somehow issuing the request incorrectly?