A list of lists, where each inner list contains a query, a document, or document chunks to be vectorized.
Each inner list in inputs represents a set of text elements that will be embedded together. Each element in the list is encoded not just independently, but also encodes context from the other elements in the same list.
inputs = [["text_1_1", "text_1_2", ..., "text_1_n"],
["text_2_1", "text_2_2", ..., "text_2_m"]]
Document Chunks. Most commonly, each inner list contains chunks from a single document, ordered by their position in the document. In this case:
inputs = [["doc_1_chunk_1", "doc_1_chunk_2", ..., "doc_1_chunk_n"],
["doc_2_chunk_1", "doc_2_chunk_2", ..., "doc_2_chunk_m"]]
Each chunk is encoded in context with the others from the same document, resulting in more context-aware embeddings. We recommend that supplied chunks not have any overlap.
Context-Agnostic Behavior for Queries and Documents. If there is one element per inner list, each text is embedded independently—similar to standard (context-agnostic) embeddings:
inputs = [["query_1"], ["query_2"], ..., ["query_k"]]
inputs = [["doc_1"], ["doc_2"], ..., ["doc_k"]]
Therefore, if the inputs are queries, each inner list should contain a single query (i.e., a length of one), as shown above, and the input_type should be set to query.
The following constraints apply to the inputs list:
- The list must not contain more than 1,000 inputs.
- The total number of tokens across all inputs must not exceed 120K.
- The total number of chunks across all inputs must not exceed 16K.