OpenAI API Purchase Embeddings

According to its page, Embeddings are commonly used for the following because embeddings directly measure relatedness of text string in numeric form/vectors:

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Three parts: get embeddings, embedding models and use cases. But before jumping to that, first, if we applied the file uploading -> fine-tuning service, do we still need to purchase embeddings?

To get an embedding, send your text string to the embeddings API endpoint along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.

We recommend using text-embedding-ada-002 for nearly all use cases. It’s better, cheaper, and simpler to use.

openAI used Amazon fine-food review dataset as example to demonstrate some use cases. One of those is Embedding as a text feature encoder for ML algorithms(Incorporating embeddings will improve the performance of any machine learning model, if some of the relevant inputs are free text, generally the embedding representation is very rich and information dense. For example, reducing the dimensionality of the inputs using SVD or PCA, even by 10%, generally results in worse downstream performance on specific tasks.) , the other is Classification using the embedding features.

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
 
df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
df.to_csv('output/embedded_1k_reviews.csv', index=False)

Another use case is Code search using embeddings, Recommendations using embeddings similar to text search.

from openai.embeddings_utils import get_embedding, cosine_similarity
 
df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
 
def search_functions(df, code_query, n=3, pprint=True, n_lines=7):
   embedding = get_embedding(code_query, model='text-embedding-ada-002')
   df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))
 
   res = df.sort_values('similarities', ascending=False).head(n)
   return res
res = search_functions(df, 'Completions API tests', n=3)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.