Image by Igor Omilaev on Unsplash
Implementing semantic search within corporate databases can be challenging and requires significant effort. However, does it have to be this way? In this article, I demonstrate how you can utilize PostgreSQL along with OpenAI Embeddings to implement semantic search on your data. If you prefer not to use OpenAI Embeddings API, I will provide you with links to free embedding models.
On a very high level, vector databases with LLMs allow to do semantic search on available data (stored in databases, documents, etc.) Thank to the “Efficient Estimation of Word Representations in Vector Space” paper (also known as “Word2Vec Paper”) co-authored by legendary Jeff Dean, we know how to represent words as real-valued vectors. Word embeddings are dense vector representations of words in a vector space where words with similar meanings are closer to each other. Word embeddings capture semantic relationships between words and there are more than one technique to create them.
Image by the author
Let’s practice and use OpenAI’s text-embedding-ada model! The choice of distance function typically doesn’t matter much. OpenAI recommends cosine similarity. If you don’t want to use OpenAI embeddings and prefer running a different model locally instead of making API calls, I suggest considering one of the SentenceTransformers pretrained models. Choose your model wisely.
import os
import openai
from openai.embeddings_utils import cosine_similarity
openai.api_key = os.getenv(“OPENAI_API_KEY”)
def get_embedding(text: str) -> list:
response = openai.Embedding.create(
input=text,
model=”text-embedding-ada-002″
)
return response[‘data’][0][’embedding’]
good_ride = “good ride”
good_ride_embedding = get_embedding(good_ride)
print(good_ride_embedding)
# [0.0010935445316135883, -0.01159335020929575, 0.014949149452149868, -0.029251709580421448, -0.022591838613152504, 0.006514389533549547, -0.014793967828154564, -0.048364896327257156, -0.006336577236652374, -0.027027441188693047, …]
len(good_ride_embedding)
# 1536
Now that we have developed an understanding of what an embedding is, let’s utilize it to sort some reviews.
good_ride_review_1 = “I really enjoyed the trip! The ride was incredibly smooth, the pick-up location was convenient, and the drop-off point was right in front of the coffee shop.”
good_ride_review_1_embedding = get_embedding(good_ride_review_1)
cosine_similarity(good_ride_review_1_embedding, good_ride_embedding)
# 0.8300454513797334
good_ride_review_2 = “The drive was exceptionally comfortable. I felt secure throughout the journey and greatly appreciated the on-board entertainment, which allowed me to have some fun while the car was in motion.”
good_ride_review_2_embedding = get_embedding(good_ride_review_2)
cosine_similarity(good_ride_review_2_embedding, good_ride_embedding)
# 0.821774476808789
bad_ride_review = “A sudden hard brake at the intersection really caught me off guard and stressed me out. I wasn’t prepared for it. Additionally, I noticed some trash left in the cabin from a previous rider.”
bad_ride_review_embedding = get_embedding(bad_ride_review)
cosine_similarity(bad_ride_review_embedding, good_ride_embedding)
# 0.7950041130579355
While the absolute difference may appear small, consider a sorting function with thousands and thousands of reviews. In such cases, we can prioritize highlighting only the positive ones at the top.
Once a word or a document has been transformed into an embedding, it can be stored in a database. This action, however, does not automatically classify the database as a vector database. It’s only when the database begins to support fast operations on the vector that we can rightfully label it as a vector database.
There are numerous commercial and open-source vector databases, making it a highly discussed topic. I will demonstrate the functioning of vector databases using a pgvector, an open-source PostgreSQL extension that enables vector similarity search functionalities for arguably the most popular database.
Let’s run the PostgreSQL container with pgvector:
docker pull ankane/pgvector
docker run –env “POSTGRES_PASSWORD=postgres” –name “postgres-with-pgvector” –publish 5432:5432 –detach ankane/pgvector
Let’s start pgcli to connect to the database (pgcli postgres://postgres:postgres@localhost:5432) and create a table, insert the embeddings we computed above, and then select similar items:
— Enable pgvector extension.
CREATE EXTENSION vector;
— Create a vector column with 1536 dimensions.
— The `text-embedding-ada-002` model has 1536 dimensions.
CREATE TABLE reviews (text TEXT, embedding vector(1536));
— Insert three reviews from the above. I omitted the input for your convinience.
INSERT INTO reviews (text, embedding) VALUES (‘I really enjoyed the trip! The ride was incredibly smooth, the pick-up location was convenient, and the drop-off point was right in front of the coffee shop.’, ‘[-0.00533589581027627, -0.01026702206581831, 0.021472081542015076, -0.04132508486509323, …’);
INSERT INTO reviews (text, embedding) VALUES (‘The drive was exceptionally comfortable. I felt secure throughout the journey and greatly appreciated the on-board entertainment, which allowed me to have some fun while the car was in motion.’, ‘[0.0001858668401837349, -0.004922827705740929, 0.012813017703592777, -0.041855424642562866, …’);
INSERT INTO reviews (text, embedding) VALUES (‘A sudden hard brake at the intersection really caught me off guard and stressed me out. I was not prepared for it. Additionally, I noticed some trash left in the cabin from a previous rider.’, ‘[0.00191772251855582, -0.004589076619595289, 0.004269456025213003, -0.0225954819470644, …’);
— sanity check
select count(1) from reviews;
— +——-+
— | count |
— |——-|
— | 3 |
— +——-+
We are prepared to search for similar documents now. I have shortened the embedding for “good ride” again because printing 1536 dimensions is excessive.
— The embedding we use here is for “good ride”
SELECT substring(text, 0, 80) FROM reviews ORDER BY embedding <-> ‘[0.0010935445316135883, -0.01159335020929575, 0.014949149452149868, -0.029251709580421448, …’;
— +————————————————————————–+
— | substring |
— |————————————————————————–|
— | I really enjoyed the trip! The ride was incredibly smooth, the pick-u… |
— | The drive was exceptionally comfortable. I felt secure throughout the… |
— | A sudden hard brake at the intersection really caught me off guard an… |
— +————————————————————————–+
SELECT 3
Time: 0.024s
Completed! As you can observe, we have computed embeddings for multiple documents, stored them in the database, and conducted vector similarity searches. The potential applications are vast, ranging from corporate searches to features in medical record systems for identifying patients with similar symptoms. Furthermore, this method is not restricted to texts; similarity can also be calculated for other types of data such as sound, video, and images.
Enjoy!
Semantic Search with PostgreSQL and OpenAI Embeddings was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.