Photo by Roman Kraft on Unsplash
Making LLMs into production environments using Sentence Transformers and Qdrant
Large language models (LLMs) generated a global buzz in the machine learning community with recent releases of generative AI tools such as Chat-GPT, Bard, and others alike. One of the core ideas behind these solutions is to compute a numerical representation of unstructured data (such as texts and images) and find similarities between these representations.
However, taking all of these concepts into a production environment has its own set of machine learning engineering challenges:
How to generate these representations quickly?How to store them in a proper database?How to quickly compute similarities for production environments?
In this article, I introduce two open-source solutions that aim to solve these questions:
Sentence Transformers [1]: an embedding generation technique based on textual information and;Qdrant: a vector database capable of storing embeddings and providing an easy interface to query them.
These tools are applied to NPR [2], a News Portal Recommendation dataset (openly available at Kaggle) which was built to support the academic community to develop recommendation algorithms. By the end of the articles, you’ll see how to:
Generate news embeddings with Sentence TransformersStore embeddings with QdrantQuery embeddings to recommend news articles
All code for this article is made available on Github.
1. Generating embeddings with Sentence Transformers
First of all, we need to find a way to convert input data into vectors, which we’ll call embeddings (if you want to dig deeper into the embedding concept, I recommend Boykis’ article What Are Embeddings? [3]).
So let’s take a look at what kind of data we can work on with the NPR dataset:
import pandas as pd
df = pd.read_parquet(“articles.parquet”)
df.tail()Sample data from NPR (image generated by author)
We can see that NPR has some interesting textual data such as the articles’ title and body content. We can use them in an embedding-generation process as the following image:
Embedding generation process (image by author)
So once we define the textual features from our input data, we need to establish an embedding model to generate our numerical representation. Lucky for us, there are websites like HuggingFace where you can look for pre-trained models suitable for specific languages or tasks. In our example, we can use the neuralmind/bert-base-portuguese-cased model, which was trained in Brazilian portuguese for the following tasks:
Named Entity RecognitionSentence Textual SimilarityRecognizing Textual Entailment
Code-wise, this is how we translate the embedding-generation process:
from sentence_transformers import SentenceTransformer
model_name = “neuralmind/bert-base-portuguese-cased”
encoder = SentenceTransformer(model_name_or_path=model_name)
title = “””
Paraguaios vão às urnas neste domingo (30) para escolher novo presidente
“””
sentence = title
sentence_embedding = encoder.encode(sentence)
print (sentence_embedding)
# output: np.array([-0.2875876, 0.0356041, 0.31462672, 0.06252239, …])
So given an example input data, we can concatenate the title and tags content into a single text and pass it to an encoder to generate the text embedding.
We can apply the same process for all other articles in the NPR dataset:
def generate_item_sentence(item: pd.Series, text_columns=[“title”]) -> str:
return ‘ ‘.join([item[column] for column in text_columns])
df[“sentence”] = df.apply(generate_item_sentence, axis=1)
df[“sentence_embedding”] = df[“sentence”].apply(encoder.encode)Note: bear in mind that this process might take a bit longer depending on your machine’s processing power.
Once we have the embeddings for all news articles, let’s define a strategy to store them.
2. Storing embeddings
Since generating embeddings might be an expensive process, we can use a vector database to store these embeddings and execute queries based on diverse strategies.
There are several vector database software to achieve this task, but I’ll use Qdrant for this article, which is an open-source solution with APIs available for popular programming languages like Python, Go, and Typescript. For a better comparison between these vector databases, check this article [4].
Setting Qdrant
To deal with all Qdrant operations, we need to create a client object that points out to a vector database. Qdrant lets you create a free tier service to test remote connection to a database but, for the sake of simplicity, I’ll create and persist the database locally:
from qdrant_client import QdrantClient
client = QdrantClient(path=”./qdrant_data”)
Once this connection is stablished, we can create a collection in the database that will store the news articles embeddings:
from qdrant_client import models
from qdrant_client.http.models import Distance, VectorParams
client.create_collection(
collection_name = “news-articles”,
vectors_config = models.VectorParams(
size = encoder.get_sentence_embedding_dimension(),
distance = models.Distance.COSINE,
),
)
print (client.get_collections())
# output: CollectionsResponse(collections=[CollectionDescription(name=’news-articles’)])
Notice that vector configuration parameters are used to create the collection. These parameters tell Qdrant some properties from the vectors, like their size and the distance metric to be used when comparing vectors (I’ll use the cosine similarity but you can also use other strategies like the inner product or Euclidean distance).
Generating Vectors Points
Prior to finally populating the database, we need to create proper objects to be uploaded. In Qdrant, vectors can be stored using a PointStruct class, which you can use to define the following properties:
id: the vector’s ID (in the NPR case, is the newsId)vector: a 1-dimensional array representing the vector (generated by the embedding model)payload: a dictionary containing any other relevant metadata that can later be used to query vectors in a collection (in the NPR case, the article’s title, body, and tags)from qdrant_client.http.models import PointStruct
metadata_columns = df.drop([“newsId”, “sentence”, “sentence_embedding”], axis=1).columns
def create_vector_point(item:pd.Series) -> PointStruct:
“””Turn vectors into PointStruct”””
return PointStruct(
id = item[“newsId”],
vector = item[“sentence_embedding”].tolist(),
payload = {
field: item[field]
for field in metadata_columns
if (str(item[field]) not in [‘None’, ‘nan’])
}
)
points = df.apply(create_vector_point, axis=1).tolist()
Uploading Vectors
Finally, after all items are turned into point structures, we can upload them in chunks to the database:
CHUNK_SIZE = 500
n_chunks = np.ceil(len(points)/CHUNK_SIZE)
for i, points_chunk in enumerate(np.array_split(points, n_chunks)):
client.upsert(
collection_name=”news-articles”,
wait=True,
points=points_chunk.tolist()
)
3. Querying Vectors
Now that collections are finally populated with vectors, we can start querying the database. There are many ways we can input information to query the database, but I think there are 2 very useful inputs we can use:
An input textAn input vector ID
3.1 Querying vectors with an input vector
Let’s say we built this vector database to be used in a search engine. In this case, we expect the user’s input to be an input text and we have to return the most relevant items.
Since all operations in a vector database are done with….VECTORS, we first need to transform the user’s input text into a vector so we can find similar items based on that input. Recall that we used Sentence Transformers to encode textual data into embeddings, so we can use the very same encoder to generate a numerical representation for the user’s input text.
Since the NPR contains news articles, let’s say the user typed “Donald Trump” to learn about US elections:
query_text = “Donald Trump”
query_vector = encoder.encode(query_text).tolist()
print (query_vector)
# output: [-0.048, -0.120, 0.695, …]
Once the input query vector is computed, we can search for the closest vectors in the collection and define what sort of output we want from those vectors, like their newsId, title, and topics:
from qdrant_client.models import Filter
from qdrant_client.http import models
client.search(
collection_name=”news-articles”,
query_vector=query_vector,
with_payload=[“newsId”, “title”, “topics”],
query_filter=None
)Note: by default, Qdrant uses Approximate Nearest Neighbors to scan for embeddings quickly, but you can also do a full scan and bring the exact nearest neighbors — just bear in mind this is a much more expensive operation.
After running this operation, here are the generated output titles (translated into english for better comprehension):
Input Sentence: Donald TrumpOutput 1: Paraguayans go to the polls this Sunday (30) to choose a new presidentOutput 2: Voters say Biden and Trump should not run in 2024, Reuters/Ipsos poll showsOutput 3: Writer accuses Trump of sexually abusing her in the 1990sOutput 4: Mike Pence, former vice president of Donald Trump, gives testimony in court that could complicate the former president
It seems that besides bringing news related to Trump himself, the embedding model also managed to represent topics related to presidential elections. Notice that in the first output, there is no direct reference to the input term “Donald Trump” other than the presidential election.
Also, I left out a query_filter parameters. This is a very useful tool if you want to specify that the output must satisfy some given condition. For instance, in a news portal, it is frequently important to filter only the most recent articles (say from the past 7 days onwards). Therefore, you could query for news articles that satisfy a minimum publication timestamp.
Note: in the news recommendation context, there are multiple concerning aspects to consider like fairness and diversity. This is an open topic of discussion but, should you be interested in this area, take a look at the articles from the NORMalize Workshop.
3.2 Querying vectors with an input vector ID
Lastly, we can ask the vector database to “recommend” items that are closer to some desired vector IDs but far from undesired vector IDs. The desired and undesired IDs are called positive and negative examples, respectively, and they are thought of as seeds for the recommendation.
For instance, let’s say we have the following positive ID:
seed_id = ‘8bc22460-532c-449b-ad71-28dd86790ca2’
# title (translated): ‘Learn why Joe Biden launched his bid for re-election this Tuesday’
We can then ask for items similar to this example:
client.recommend(
collection_name=”news-articles”,
positive=[seed_id],
negative=None,
with_payload=[“newsId”, “title”, “topics”]
)
After running this operation, here are the translated output titles :
Input item: Learn why Joe Biden launched his bid for re-election this TuesdayOutput 1: Biden announces he will run for re-electionOutput 2: USA: the 4 reasons that led Biden to run for re-electionOutput 3: Voters say Biden and Trump should not run in 2024, Reuters/Ipsos poll showsOutput 4: Biden’s advisor’s gaffe that raised doubts about a possible second government after the election
Conclusion
This article demonstrates how to combine LLMs and vector databases to serve recommendations. In particular, Sentence Transformers were used to generate numerical representations (embeddings) from textual news articles in the NPR dataset. Once these embeddings are computed, they can populate a vector database such as Qdrant which facilitates querying vectors based on several strategies.
A whole lot of improvements can be made after the examples in this article, such as:
testing other embedding modelstesting other distance metricstesting other vector databasesusing compile-based programming languages like Go for better performancecreating an API to serve recommendations
In other words, many ideas may come up to improve the machine learning engineering of recommendations with LLMs. So, if you feel like sharing your ideas about these improvements, don’t hesitate to send me a message here 🙂
About Me
I am a senior data scientist at Globo, a Brazilian media-tech company. Working at the company’s recommendation team, I am surrounded by an amazing and talented team who put a lot of effort to deliver personalized content to millions of users through digital products like G1, GE, Globoplay, and many others. This article wouldn’t be possible without their indispensable knowledge.
References
[1] N. reimers and I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), Association for Computational Linguistics.
[2] J. Pinho, J. Silva and L. Figueiredo, NPR: a News Portal Recommendations dataset (2023), ACM Conference on Recommender Systems
[3] V. Boykis, What are embeddings?, personal blog
[4] M. Ali, The Top 5 Vector Databases (2023), DataCamp blog
Large Language Models and Vector Databases for News Recommendations was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.