Posted in

使用 OpenAI 嵌入模型和 pgvector 在 PostgreSQL 中进行相似性搜索_AI阅读总结 — 包阅AI

包阅导读总结

1. 关键词:OpenAI 嵌入模型、相似性搜索、Pgvector、RAG 应用、向量嵌入

2. 总结:本文介绍了向量嵌入的概念,包括其定义、关键概念和生成方式。阐述了 OpenAI 嵌入模型的用途及如何用于相似性搜索和 RAG 应用,还讲解了如何使用 OpenAI 嵌入模型 API 及选择合适的模型,包括设置 Python 环境、获取 API 密钥和调用 API 等步骤。

3. 主要内容:

– 向量嵌入概念

– 定义:数据的数值表示

– 关键概念:数值表示、维度、语义关系

– 生成模型:BGE、Sentence Transformers 等

– OpenAI 嵌入模型

– 可用模型:text-embedding-3-large、text-embedding-3-small、text-embedding-ada-002

– 模型用途:有助于相似性搜索,超越传统关键词搜索

– 在 RAG 中的应用:通过相似性搜索获取文档,为 LLM 提供更多上下文以生成更准确响应

– 使用 OpenAI 嵌入模型 API

– 设置 Python 环境

– 获取 API 密钥

– 调用 API 及参数介绍

– 可用模型列表及选择模型考虑因素

思维导图:

文章地址:https://www.timescale.com/blog/similarity-search-on-postgresql-using-openai-embeddings-and-pgvector/

文章来源:timescale.com

作者:Team Timescale

发布时间:2024/8/22 13:58

语言:英文

总字数:2601字

预计阅读时间:11分钟

评分:83分

标签:嵌入模型,PostgreSQL,相似性搜索,OpenAI,检索增强生成 (RAG)


以下为原文内容

本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com

What Are Embeddings?

Vector embeddings, often referred to simply as embeddings, are numerical representations of data such as words, sentences, images, audio, time-series data, or even molecular structures. Vector embeddings are useful because they help capture the semantic or contextual relationships between data points.

Here are some key concepts in vector embeddings:

  • Numerical representation: Each object (sentence, image, etc.) is represented as a point in a high-dimensional vector space. This point is defined by a vector of numbers (i.e., a list of numerical values).
  • Dimensionality: It represents the number of dimensions in the vector space. Higher dimensions allow for more detailed and nuanced representations of data.
  • Semantic relationships: Semantically similar objects are represented by vectors that are close together in the high-dimensional vector space. For example, in word embeddings, the words “king” and “queen” might be close together, while “king” and “car” would be far apart.

Embeddings are generated using pre-trained AI models. Some examples of the models you can use are BGE and Sentence Transformers for text, CLIP for images, and Wav2Vec for audio.

OpenAI, for instance, offers developers access to its powerful text embedding models via an API to generate embeddings. The currently available models are:

  • text-embedding-3-large: OpenAI’s best-performing embedding model for tasks in both English and non-English languages.
  • text-embedding-3-small: Smaller in size and a cost-effective embedding model, it improves upon its predecessor, text-embedding-ada-002.
  • text-embedding-ada-002: This embedding model is the second-generation model that replaced the 16 first-generation models that OpenAI had released.

In this article, we will explore how OpenAI’s Embedding Models generate vector embeddings, why these embeddings are useful for similarity search, and how you can utilize them to build retrieval-augmented generation (RAG) applications.

How Are Embedding Models Useful?

Embedding models help transform data into vector embeddings, capturing semantic relationships within the data. This enables similarity search because related data points are closer to each other in vector space.

Similarity search goes beyond traditional keyword-based searches and is particularly useful for document retrieval.

Consider a scenario where a user wants to search for documents related to “machine learning.” A keyword-based search will only retrieve documents containing the exact phrase “machine learning” (or, at best, other documents that are tagged with the phrase). An embedding search, on the other hand, will provide a more comprehensive result set by including documents that mention similar concepts like “artificial intelligence,” “neural networks,” or “deep learning.”

How Are Embeddings Used in RAG?

When building AI applications using large language models (LLMs), you often need to provide additional context or knowledge to the LLM so it can respond accurately to queries not covered in its training data.

These use cases are widespread in enterprise applications, which need to leverage internal company documents or knowledge base. This architecture, where the LLM’s generation is augmented with data retrieved from a data store, is known as retrieval-augmented generation or RAG.

In RAG systems, when a user poses a query, the application first retrieves relevant documents or data from an external dataset using similarity search. The retrieved information is then incorporated into the prompt given to the LLM, which enables it to generate more accurate and contextually grounded responses.

A typical RAG application has the following steps:

  • Document ingestion: Documents are broken down into smaller chunks and transformed into vector embeddings (using an embedding model). These embeddings are then stored in a specialized vector database like pgvector.
  • Query embedding: When a user submits a query, it is also converted into an embedding using the same embedding model.
  • Document retrieval: Using similarity search through the vector space, documents similar to the query are retrieved.
  • Input augmentation: The top retrieved documents are combined with the original query to create an augmented input (prompt). This provides the LLM with additional context to generate a more comprehensive and accurate response.
  • Response generation: The LLM combines the retrieved information with its knowledge to generate the final response.
A diagram representing the steps of the typical RAG application: from the docs embedded as vectors to the query itself, going over the top k retrieved results, the LLM, and finally, the answer.

Let’s use a typical example of an RAG system, such as a customer service chatbot for a tech company, to understand how this works. In this instance, the company’s repository of product manuals, troubleshooting guides, and support tickets are converted to vector embeddings and stored in a vector database. When a user asks a question (e.g., “How do I reset my router?”), the relevant information is retrieved using similarity search and sent to the LLM, which then generates its response based on the context provided.

Embedding models are essential for building RAG applications. By leveraging OpenAI Embeddings models and vector stores like pgvector, we can create powerful RAG systems.

Let’s explore how you can do this yourself.

Using OpenAI Embeddings API

Below, we will look at how the OpenAI Embeddings API endpoint can be used to generate embeddings. Then, we will showcase how to build an RAG application using the generated embeddings.

Setting up Python environment

First, you should set up your Python environment. Pyenv is a great tool to manage multiple Python installations and virtual environments on your machine.

Once you have the Python environment, you can follow these steps to launch a Jupyter Notebook:

$ pip install jupyterlab$ jupyter lab

Getting an API key

To call the OpenAI Embeddings API, you must first obtain an API key. For this, you need to sign up for the OpenAI developer platform. You can find it here: https://platform.openai.com/api-keys.

The API Keys page in the OpenAI platform

Once you have generated your API key, you can save it as an environment variable in your Jupyter Notebook.

import osos.environ["OPEN_AI_API_KEY"] = 'YOUR-API-KEY'

Calling the OpenAI embeddings API

We need to install the openai Python module to work with Open AI embeddings.

pip install openai

In Python, you can create an embedding of a text input in the following manner.

from openai import OpenAIimport osclient = OpenAI(api_key= os.getenv('OPEN_AI_API_KEY'))response = client.embeddings.create(    model="text-embedding-3-small",    input=["The capital of France is Paris",           "An apple a day keeps the doctor away"],    encoding_format="float"    )

If these input parameters are unfamiliar to you, here’s an overview:

  • model (required): ID of the embedding model to use. We’ll shortly discuss the models available on the platform.
  • input(required): This is the input text that you want to convert to embeddings. To create embeddings for multiple inputs in a single request, you can provide an array of strings or an array of token arrays.
  • encoding format (optional): The format of the returned embeddings. Can be either ‘float’ or ‘base64’.
  • dimensions (optional): The number of dimensions of the resulting embeddings. The upper limit is the maximum dimension supported by the model. This feature is only supported in ‘text-embedding-3’ models.
  • user (optional): A unique identifier for your end user, which assists OpenAI in monitoring and detecting abuse. Read more here.

The API response will be a JSON, where the data key contains a list of embedding objects:

{  "object": "list",  "data": [    {      "object": "embedding",      "embedding": [        0.03246759,        0.010273109,        ....        0.004335752,      ],      "index": 0    }  ],  "model": "text-embedding-3-small",  "usage": {    "prompt_tokens": 21,    "total_tokens": 21  }}
  • response.data: This will give you a list of embedding objects.
    • object: Type of data object.
    • embedding: An array of floating-point numbers representing the vector embedding. Each number in the array is a dimension in the high-dimensional space, capturing semantic information about the input text.
    • index: The index of the input text corresponding to this embedding.
  • response.model: Name of the embedding model used.
  • response.object: Type of the data returned. In this case, it would be a ‘list’ of embedding objects.
  • response.usage: The number of tokens used in the process. Useful if you want to calculate the expenses of each call.

So, in the above example, if you want to get the vector embeddings of the first string, you can do so in the following manner:

embedding_1 = response.data[0].embedding

List of available OpenAI embedding models

On January 25, 2024, OpenAI launched two new embedding models: text-embedding-3-large and text-embedding-3-small. Currently, they list the following embedding models in their docs.

A description of OpenAI's embedding models in the OpenAI page

Of the above models, ‘text-embedding-3-large’ is the best performing. The ‘text-embedding-3-small’ is the most affordable model, with a price of $0.00002 per one thousand tokens, which is 5x lower than the price of ‘text-embedding-ada-002’ ($0.0001 per one thousand tokens).

Choosing the embedding model

A powerful capability of the ‘text-embedding-3’ series of embedding models is its ability to shorten output dimensions with minimal performance loss, a particularly useful tactic for reducing an application’s memory footprint.

The ‘text-embedding-3-large’ model outperforms the ‘text-embedding-ada’ model even when the output dimensions are shortened to 256.

Here’s how you can use the same model but reduce dimensions:

from openai import OpenAIimport osclient = OpenAI(api_key= os.getenv('OPEN_AI_API_KEY'))response = client.embeddings.create(    model="text-embedding-3-small",    input=["The capital of France is Paris",           "An apple a day keeps the doctor away"],    encoding_format="float",    dimensions: 256, # Generate an embedding with 256 dimensions    )print(len(response['data'])) # 256

Dimension shortening happens by simply truncating numbers from the end of the vector. Doing this on any other embedding model will cause the output to lose some, if not all, of its semantic meaning.

However, this is not the case with the OpenAI Embeddings models as they have been trained with a technique that allows embedding to be shortened without losing their semantic-representing properties. This technique is called Matryoshka Representation Learning (MRL). Named after the Russian nesting dolls (Matryoshkas), MRL enables embeddings to adapt to various dimensions without requiring multiple separate models.

Training an embedding model with the MRL technique begins with smaller (coarser) sub-vectors and gradually moves to larger (finer) dimensions, making each sub-vector meaningful. These sub-vectors usually double in size each time, following a pattern like 256, 512, and 1,024. This means that different vector sizes will have similar meanings, with only a slight loss of detail. Therefore, it is safe to reduce the dimensions of your embeddings as long as it’s done in multiples of two.

Even with a reduced size, MRL-trained embeddings maintain high accuracy and effectiveness, comparable to larger, independently trained embeddings. Therefore, while not deprecated, it is advisable not to use the second-generation ‘text-embedding-ada-002’ model and instead opt for the third-generation models.

Let’s now look at how OpenAI Embeddings can be used in conjunction with PostgreSQL and pgvector to power similarity search.

Similarity Search Using OpenAI Embeddings Models and Pgvector

PostgreSQL is one of the most popular open-source databases in the market. With pgvector, pgvectorscale, and pgai extensions, developers can turn PostgreSQL into a high-performance vector database.

Pgvector is an open-source extension that brings vector-handling capabilities to PostgreSQL. It enables efficient vector storage and retrieval directly within the database.

Pgvectorscale extends pgvector’s capabilities by adding features like the StreamingDiskANN index and Statistical Binary Quantization to optimize vector search and storage in PostgreSQL.

Pgai is an open-source PostgreSQL extension that brings AI workflows, such as embedding creation, reranking, and LLM completions, directly into the database. It makes it easier to build AI capabilities, like semantic search or RAG.

In the steps below, we will use PostgreSQL in conjunction with pgvector, pgai, and pgvectorscale.

Setting up a PostgreSQL database

First, you need a working setup of PostgreSQL with the extensions. You can install these manually, use a pre-built Docker container, or simply use Timescale Cloud, which comes preinstalled with pgai, pgvector, and pgvectorscale.

Let’s use Timescale Cloud to create a free PostgreSQL database. Once done, obtain your service URL from the dashboard.

The Connect to you service page in the Timescale Cloud console

You should also add your password to the service URL and save it as an environment variable.

import osos.environ["DB_CONNECTION_STRING"] ="postgres://tsdbadmin:<password>@dahhp8x0y3.p191xw8e4w.tsdb.cloud.timescale.com:32162/tsdb?sslmode=require"

We can now set up the database and enable the three extensions.

conn = psycopg2.connect(os.getenv('DB_CONNECTION_STRING'))def setup_database():   cursor = conn.cursor()   pgvector = """CREATE EXTENSION IF NOT EXISTS vector"""   pgai = """CREATE EXTENSION IF NOT EXISTS ai CASCADE"""   pgvectorscale = """CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE"""   cursor.execute(pgvector)   cursor.execute(pgai)   cursor.execute(pgvectorscale)   conn.commit()setup_database()

Creating the dataset

As a test dataset, we’ll use a list of sentences about the history of technology.

history_of_technology = [    "Early humans used simple stone tools for hunting and gathering.",    "The invention of the wheel around 3500 BCE revolutionized transportation.",    "Ancient Egyptians developed techniques for building the pyramids.",    "The Greeks made significant advancements in engineering and architecture.",    "Romans introduced aqueducts and concrete to improve infrastructure.",    "Islamic scholars preserved ancient knowledge during the Middle Ages.",    "The printing press, invented by Johannes Gutenberg in the 15th century, revolutionized information dissemination.",    "The Industrial Revolution in the 18th century marked a significant leap in manufacturing and production technologies.",    "Advances in artificial intelligence are transforming industries and everyday life.",    "The development of blockchain technology has the potential to revolutionize finance and data security.",    "Quantum computing is emerging as the next frontier in computational power.",    "Autonomous vehicles are being developed to change the landscape of transportation."]

Let’s now create a table in PostgreSQL.

def create_table():   with conn.cursor() as cur:       # Create the table       cur.execute("""       CREATE TABLE IF NOT EXISTS history_of_tech (           id bigserial primary key,           content text,           embedding vector(1536)       )       """)   conn.commit()create_table()

The embedding column is currently NULL. In the next step, we will generate embeddings and populate the column.

Converting data into embeddings

We will now convert the data from the content column of the table into embeddings using the openai_embed function of pgai.

To do this, you will need the OPENAI_API_KEY.

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')def generate_embeddings():   with conn.cursor() as cur:       cur.execute("""       UPDATE history_of_tech       SET embedding = openai_embed(           'text-embedding-3-small'           , content           , _api_key=>%s) WHERE embedding is NULL;       """, (OPENAI_API_KEY,))   conn.commit()generate_embeddings()

With pgai, you can generate embeddings from popular AI models using a simple SQL query.

Creating a StreamingDiskANN index

Pgvectorscale adds a third approximate nearest neighbor (ANN) search algorithm to pgvector, called StreamingDiskANN, in addition to IVFFLAT and HNSW present today. StreamingDiskANN uses a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, potentially even traversing the entire graph!

For larger datasets, you should create a StreamingDiskANN index to speed up the search. Here’s how you can do it:

def create_index():   with conn.cursor() as cur:       cur.execute("""       CREATE INDEX document_embedding_idx ON history_of_tech USING diskann (embedding);       """)   conn.commit()create_index()

Compared to Pinecone’s storage-optimized index (s1), PostgreSQL with pgvector and pgvectorscale achieves 28x lower p95 latency and 16x higher query throughput!

We are now ready to perform similarity search on our table.

Now, you can search for similar documents using an SQL query. To do that, we will first convert the query into an embedding and then use that to perform the search.

def similarity_search(query):   sql = """   WITH query_embedding AS (       SELECT openai_embed(           'text-embedding-3-small'           , %s           , _api_key=>%s       ) AS embedding   )   SELECT content   FROM history_of_tech, query_embedding   ORDER BY history_of_tech.embedding <=> query_embedding.embedding   LIMIT 5   """   with conn.cursor() as cur:       cur.execute(sql, (query, OPENAI_API_KEY,))       return cur.fetchall()results = similarity_search("What were some key advancements in communication technology?")print(results)

Here are the results:

The query results

As you can see, the results don’t necessarily contain the same keywords as the query, but they do convey the same overall semantic meaning.

Final Words

In this article, we have demonstrated how OpenAI Embeddings models can transform data into high-dimensional numerical vectors. The third-generation models, in particular, offer great performance and flexibility, including dimension shortening with minimal loss of semantic meaning.

By leveraging OpenAI Embeddings models for embedding creation and PostgreSQL with pgvector, pgai, and pgvectorscale for scalable vector search, you can create powerful applications that take advantage of similarity search.

Pgai and pgvectorscale are open source under the PostgreSQL License and available for you to use in your AI projects today. To install pgai and pgvectorscale, check out the GitHub repos of pgai and pgvectorscale. You can also access them on any database service on Timescale’s cloud PostgreSQL platform.

Learn more

Ingest and query in milliseconds, even at terabyte scale.