Unfolding the universe of possibilities..

Navigating the waves of the web ocean

How to Measure the Success of Your RAG-based LLM System

Use the machines to grade the machines

Including a new novel method for judging answers with a qualitative score and detailed explanation.

Image generated by Stable Diffusion XL

Research Augmented Generation, or RAG, is easily the most common use case for Large Language Models (LLMs) that have emerged this year. While text summarization and generation are often the focus of individual users, businesses have realized that they need the ability to use their data to leverage this technology. Reflecting on how I still use LLMs, text generation is high on the list. I want to ask questions to Bard and have it search the web; I want Claude to rewrite emails or blog posts to punch up my content. But the most exciting use I have encountered is piping my own data into the LLM. I want to search my notes, emails, calendar, and Slack messages and have Llama function as another me (but a me that can remember details of things that happened before today).

This isn’t a post about how to build a RAG (there are many of those out there already…and I am working on that post for another day). What we will explore today is how to evaluate RAG systems.

How do we get data out of a RAG?

Let’s level set before we get into the weeds. When we talk about a RAG, there are two parts of the system we mean.

The knowledge source

A knowledge source can be a vector database, a search engine, a few text files loaded into memory, SQL data, or anything where our data is stored.

The LLM

Once we have our data, we pipe that into the LLM. This is done through the context window. So, ultimately, we search, get some text, stuff that found text into a prompt, and pass our question to the LLM. The model then takes everything from that context window and provides an answer.

Why does this matter?

When we talk about evaluating an RAG system, we have to know what we will evaluate before defining how to evaluate it. We can see now that two pieces need to be examined. The initial data retrieval is the most critical piece here. LLMs, generally, are great about summarizing/question answering with data supplied in the context. What might be lacking is the search functionality itself.

These knowledge sources have some built-in limitations. For instance, when using vector databases to store large text files, you have to ‘chunk’ your data going in. What does this mean? Let’s say you have a 100-page document, but the database can only handle saving 1 page at a time. Once you load your documents up and go to search, the database can only examine a single page at a time (ok, this is a little reductionist, but bear with me; it’s close enough for government work). When we find data that matches our search, there is a genuine possibility that the entire answer to our question doesn’t live on that single page. Too bad! We only get a single page back! This is a good illustration of why there is a need to examine this part of the system before worrying about the output from the LLM.

What do we need to evaluate?

Evaluating the initial search

This isn’t going to be the answer that most technologists want to hear. Some level of human evaluation will be required to assess the results out of your knowledge source.
Why? Well, if a business is using its data and it is private, it will be hard to automate tests to verify the search results are wholly accurate. Don’t fret, it doesn’t have to be 100% manual; we can automate parts of it. Let’s dig a little deeper.

There are two implementations I see for this initial validation and evaluation.

The first option is to have a set of common and expected questions for the data set and have a human QA team verify the search results. For example, if your team is tasked with building a customer service Q&A bot for a bank, some common questions might be, ‘What is the minimum amount I am required to keep in my account?’, ‘How do I make a payment on my loan?’, ‘What time is my branch open?’. It’s ideal if you’re QAs can supply both the questions and the expected answers in something like a CSV file that can be read programmatically; then, we can use some of our automated tests that we will cover a bit further down in this post.

If the time or resources are not available for this, the second method has a QA team search and review in real time. This is an option for early POCs and prototypes, but beware, this won’t scale for actual production workloads.

Evaluating the LLM responses

Once we have a level of comfort that the data from our knowledge source is reliable, we must ensure that the final answers are accurate. RAG systems are great for reducing the possibility of hallucinations, and this can be extended by tweaking the underlying prompt. However, it may leave out information, misunderstand the data fed to it, or try to bring in apriori knowledge from its training.

Evaluating this step is similar to assessing the search before it. If QA teams can provide questions and expected answers, we can attempt to gauge the answers programmatically.

Let’s look at some of those options now.

Evaluation frameworks

It’s essential to remember that LLMs and RAGs are very early in their maturity cycle. It has only been a year since ChatGPT debuted, and every day brings more advancements, models, frameworks, and research in this field. That being said, a handful of metrics are becoming the standard way to measure the performance of these systems.

We won’t cover ways to evaluate the base LLM. There are things like ARC, MMLU, HellaSwag, etc, that all target the underlying language model. There isn’t a need to run these measures yourself; you can check sites such as
https://llm-leaderboard.streamlit.app/ and https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
to see how different models fare. We are only interested in measuring the results we get out of the RAG systems.

That leads us to look at algorithms like ROUGE, BLEU, BLUERT, and METEOR. Let’s take a closer look at each. I’ll also include a small code snippet for how you call each metric and what the output score looks like. I import the eval framework to get started and include the reference and answer I want to score.

!pip install evaluate –quiet
!pip install rouge_score –quiet
!pip install importlib-metadata –quiet
!pip install datasets==2.10.1 –quiet
!pip install git+https://github.com/google-research/bleurt.git –quiet
!pip install sacrebleu –quiet
!pip –no-cache-dir install bert_score==0.3.9 –quiet
!pip install sacremoses –quiet
!pip install jiwer==2.5.1 –quiet
!pip install Cython

import evaluate

# If you have a translation and reference corpus:
predictions = [“In Dungeons & Dragons, the metallic dragons come in brass, bronze, copper, gold, and silver varieties. Each has scales in hues matching their name – brass dragons have brassy scales, bronze have bronze scales, etc. The metallic dragons are generally more benign than the chromatic dragons in D&D lore.”]

references =[“””The five basic chromatic dragons (red, blue, green, black, and white) and metallic dragons (copper, brass, silver, gold, and bronze) appeared in the fifth edition Monster Manual (2014) in wyrmling, young, adult, and ancient. Gem dragons and other new-to-fifth-edition dragons appeared in Fizban’s Treasury of Dragons (2021)”””]

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is a set of metrics for evaluating automatic summarization and machine-translation output. It is based on the count of overlapping n-grams between the system output and the reference summaries.

#predictions (list): list of predictions to score. Each prediction should be a string with tokens separated by spaces.
#references (list or list[list]): list of reference for each prediction or a list of several references per prediction. Each reference should be a string with tokens separated by spaces.
#rouge_types (list): A list of rouge types to calculate. Defaults to [‘rouge1’, ‘rouge2’, ‘rougeL’, ‘rougeLsum’].
#Valid rouge types:
##”rouge1″: unigram (1-gram) based scoring
##”rouge2″: bigram (2-gram) based scoring
##”rougeL”: Longest common subsequence based scoring.
##”rougeLSum”: splits text using “n”
#use_aggregator (boolean): If True, returns aggregates. Defaults to True.
#use_stemmer (boolean): If True, uses Porter stemmer to strip word suffixes. Defaults to False.

rouge = evaluate.load(‘rouge’)

results = rouge.compute(predictions=predictions, references=references, use_aggregator=False)
print(results){
‘rouge1’: [0.3636363636363636],
‘rouge2’: [0.06185567010309278],
‘rougeL’: [0.22222222222222224],
‘rougeLsum’: [0.22222222222222224]
}

BLEU (Bilingual Evaluation Understudy)

BLEU is a metric for the automatic evaluation of machine-translation output. It is based on the n-gram precision of the candidate translation against a set of reference translations.

#predictions (list of strs): Translations to score.
#references (list of lists of strs): references for each translation.
#max_order (int): Maximum n-gram order to use when computing BLEU score. Defaults to 4.
#smooth (boolean): Whether or not to apply Lin et al. 2004 smoothing. Defaults to False.

#bleu (float): bleu score
#precisions (list of floats): geometric mean of n-gram precisions,
#brevity_penalty (float): brevity penalty,
#length_ratio (float): ratio of lengths,
#translation_length (int): translation_length,
#reference_length (int): reference_length

bleu = evaluate.load(“bleu”)

results = bleu.compute(predictions=predictions, references=references,max_order=4)
print(results){
‘bleu’: 0.07342349837092484,
‘precisions’: [0.4262295081967213, 0.11666666666666667, 0.03389830508474576, 0.017241379310344827],
‘brevity_penalty’: 1.0,
‘length_ratio’: 20.333333333333332,
‘translation_length’: 61,
‘reference_length’: 3
}

BLEURT (BLEU Regression with Transformers)

BLEURT is an evaluation metric for Natural Language Generation (NLG). It is based on the BERT, which allows BLEURT to learn the statistical relationships between words and phrases and to identify patterns in NLG output.

BLEURT has been shown to outperform other NLG evaluation metrics, such as BLEU and ROUGE, on a variety of tasks, including machine translation, summarization, and question-answering.

#output is always a number between 0 and (approximately 1).
#This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts.

bleurt = evaluate.load(“bleurt”, module_type=”metric”)
results = bleurt.compute(predictions=predictions, references=references)

print(results){
‘scores’: [0.6028875708580017]
}

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

METEOR is an automatic evaluation metric for machine-translation output. It also has features not found in other metrics, such as stemming, synonymy matching, and the standard exact word matching. The metric was designed to fix some of the problems encountered in the more popular BLEU metric and also produce a good correlation with human judgment at the sentence or segment level.

#predictions: a list of predictions to score. Each prediction should be a string with tokens separated by spaces.
#references: a list of references (in the case of one reference per prediction), or a list of lists of references (in the case of multiple references per prediction. Each reference should be a string with tokens separated by spaces.
#alpha: Parameter for controlling relative weights of precision and recall. The default value is 0.9.
#beta: Parameter for controlling shape of penalty as a function of fragmentation. The default value is 3.
#gamma: The relative weight assigned to fragmentation penalty. The default is 0.5.

#outputs 0-1 – .317 is acceptable score
meteor = evaluate.load(‘meteor’)
results = meteor.compute(predictions=predictions, references=references)

print(results){
‘meteor’: 0.19316493313521543
}

I was promised something new!

While I have your attention, though, I want to introduce a new idea. While those four algorithms will give you a quantifiable score that allows your QA team to quickly determine whether an answer/summary is similar, there are some shortcomings.

First, the reference sentences and result may be similar enough to answer the question from users, but it can still receive a poor score. It is essential to run a known set of questions and answers to establish a good baseline and compare future answers against this baseline.

Second, it doesn’t tell you why the score suffers. Is it because there is a penalty for repeating words? Is it because some words are missing? Did the summary altogether leave out an essential piece of the answer? There isn’t a way to tell.

Lastly, just because a response receives a low score doesn’t necessarily mean a human would view the answer as insufficient or incorrect. The baseline can be helpful here to establish what acceptable scores may look like, but it’s important to have some skepticism when using these for judging RAG answers.

LLMs grading LLMs

BLEURT has introduced us to the idea that we can use LLMs in some way to gauge answers from an RAG system. What if we leverage this directly ourselves? We instruct an LLM to give a qualitative score for our answer and provide both bulleted reasons and a narrative explanation of the score it assigned. This gives us the best of both worlds. We can extract a numerical score to report to users and QA in a report; we can also supply more detail about why an answer scored badly.

Here is a sample prompt template that can be used for ClaudeV2. We pass instructions on how we want the model to score our answer, pass in the reference data, and pass in the answer we received from our RAG system.

prompt_data = “””
Human:
You are to rate a summarization on the scale of 0-10, with 0 being completely incorrect and 10 being a perfect summzarization of the given text.
Explain why you give your score.
Give a bullet point list of major differences between the reference and the summary.
I will supply a reference text by denoting REF: and the summarization to compare against with SUMMARY:.

REF:
The five basic chromatic dragons (red, blue, green, black, and white)
and metallic dragons (copper, brass, silver, gold, and bronze)
appeared in the fifth edition Monster Manual (2014) in wyrmling, young, adult, and ancient.
Gem dragons and other new-to-fifth-edition dragons appeared in Fizban’s Treasury of Dragons (2021)

SUMMARY:
In Dungeons & Dragons, the metallic dragons come in brass, bronze, copper, gold, and silver varieties.
Each has scales in hues matching their name – brass dragons have brassy scales, bronze have bronze scales, etc.
The metallic dragons are generally more benign than the chromatic dragons in D&D lore.

Assistant:””” Score: 5/10

Explanation:

The summary captures some accurate high-level information about metallic dragons in D&D, such as listing the main types and noting their more benign nature compared to chromatic dragons. However, it leaves out a lot of key details from the reference text:

– Does not mention the chromatic dragons (red, blue, green, black, white) that were described.
– Does not mention the different age categories (wyrmling, young, adult, ancient).
– Does not mention the Monster Manual or Fizban’s Treasury of Dragons sourcebooks.
– Only focuses on metallic dragons, while the reference text had information about both metallic and chromatic dragons.

Major differences:

– No mention of chromatic dragons
– No mention of dragon age categories
– No mention of Monster Manual or Fizban’s Treasury of Dragons
– Only discusses metallic dragons, not full scope of reference text
– Does not convey the introduction timeline of dragons across 5th edition books

There we are. If teams can provide expected answers, we can feed the RAG answers back into an LLM to have them graded. The benefits are that we don’t have to rely on the LLM’s apriori knowledge since we are still piping in the relevant data. We can use a different LLM than the one used in the RAG system, meaning we can even ask multiple models to grade our output to ensure we have a balanced assessment.

This method also gives us an excellent explanation of what was wrong. In this example, I had a question about what kinds of dragons existed in the DND universe. The judging LLM correctly identified that it didn’t mention chromatic dragons. However, it also dinged the answer for not including the ages of Dragons, the DND Monster Manual, or the expansion adventure. Those omissions weren’t important to the question I asked, but this allows QA teams to decide for themselves once.

Where do we go now?

RAG-based systems and the frameworks used to create them are advancing every day. New ways and mechanisms for grading them will continue to advance as well. There are even tools from giants like LangChain that can aid in this task, such as LangSmith .

While we wait for more advancements, using a combination of some manual validation data and either the HuggingFace metrics library or LLMs themselves gives us a great way to begin trusting these systems.

Remember, once you have confidence and are ready to deploy your new RAG into production, the evaluation of the answers doesn’t stop! As part of routine monitoring and auditing efforts, you must continue storing questions and answers and plan for a human-in-the-loop effort to grade and flag answers supplied to end users. That, however, is a topic for another day.

How to Measure the Success of Your RAG-based LLM System was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment