Discussions

Ask a Question
Back to All

VoyageAI Embeddings seem to be very similar for dis-similar documents

I've been experimenting with using VoyageAI embeddings for a project where we are using cosine similarity as a first step in matching semantic equivalence of documents.

I've noticed that compared to other embedding models I've tried like OpenAI and Bedrock, the embeddings and hence cosine similarities generated by VoyageAI embeddings are on a much more compressed range.

As an example, the docs in the Quick start tutorial example https://docs.voyageai.com/docs/quickstart-tutorial have very similar cosines even though the docs are all quite different.

Not sure if I'm doing something wrong, but I ran the reranker code for that example too, and the reranked relevance scores match what are shown on that page.

The cosines I get for this query and documents are shown below.

query = "When is Apple's conference call scheduled?"

documents = [  
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",  
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen.",  
    "20th-century innovations, from radios to smartphones, centered on electronic advancements.",  
    "Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.",  
    "Apple’s conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.",  
    "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature."  
]

VoyageAI voyage-2

array([0.57205128, 0.5865394 , 0.62985496, 0.56841758, 0.84377816,
       0.56752833])

OpenAI text-embedding-3-small

array([-0.00529196,  0.02914636,  0.14654271, -0.02232341,  0.78637504,
       -0.00315503])

Obviously they are all relative but it feels weird.

Is this just the nature of the VoyageAI embeddings or am I possibly doing something wrong?