We’ll take an in-depth look at the challenges of detecting AI-generated text, and the effectiveness of the techniques used in practice.
Photo by Houcine Ncib on Unsplash
Co-authored with Naresh Singh.
Table of contents
IntroductionBuilding an intuition for text source detectionWhat is the perplexity of a language model?Computing the perplexity of a language model’s predictionDetecting AI-generated textMisinformationWhat’s next?Conclusion
Introduction
AI-assisted technologies for writing articles or posts are everywhere now! ChatGPT has unlocked numerous applications of language-based AI, and the use of AI for any kind of content generation has gone through the roof.
However, in school assignments such as creative writing, students are expected to create their own content. However, due to the popularity and effectiveness of AI for such tasks, they might be tempted to use it. In such cases, it is important for teachers to have access to reliable and trustworthy tools to detect AI-generated content.
This article aims to provide an intuition as well as the technical specification of building such a tool. It is intended for the readers who want to understand how AI detection works intuitively as well as the technical audience who wants to build such a tool.
Let’s jump straight in!
Building an intuition for text source detection
At a high level, we’re trying to answer the question, “How likely is it that an AI language model such as GPT-3 has generated all or part of this text?”
If you step back, you will realize this is a typical daily scenario. For example, how likely would it be for your mother to say the following sentence to you?
Dear, please go to bed before 8:00 pm.Dear, please go to bed after 11:00 pm.
We would guess that the former is much more likely than the latter because you have built an understanding of the world around you and have a sense of which events are more likely to occur.
This is exactly how a language model works. Language models learn something about the world around them, specifically language. They learn to predict the next token or word given an incomplete sentence.
In the example above, if you’re told that your mother is saying something, and what has been said so far is “Dear, please go to bed”, then the most likely continuation of the sentence is going to be “before 8:00 pm”, and not “after 11:00 pm”. In technical terms, we say that you’d be more perplexed to hear the 2nd sentence as opposed to the 1st one.
Let’s dig deeper into what perplexity means in the context of a language mode.
What is the perplexity of a language model?
According to dictionary.com, perplexity is defined as
the state of being perplexed; confusion; uncertainty.
In the real world, if you encounter a situation that you don’t expect, you will be more perplexed than if you encounter a situation that you expect. For example, when driving on a road, if you see a traffic light, then you’re less likely to be perplexed as opposed to if you see a goat crossing the street.
Similarly, for a language model that is trying to predict the next word in a sentence, we say that the model will perplex us if it completes the sentence using a word that we don’t expect v/s if it uses a word that we expect. Some examples.
Sentences with a low perplexity would look like the following
It’s a sunny day outside.I’m sorry I missed the flight and was unable to reach the national park in time.
Sentences with a high perplexity would look like the following
It’s a bread day outside.I’m convenient missed light outside and could not reach national park.
Next, let’s look at how we can compute the perplexity of a prediction made by the language model.
Computing the perplexity of a language model’s prediction
The perplexity of a language model is related to the probability of being able to unsurprisingly predict the next token (word) of a sentence.
Suppose we train a language model with a vocabulary of 6600 tokens and run a single prediction step to have the model predict the next token in a sentence. Let’s assume that the likelihood of picking this token is 5/6600 (i.e., this token was not very likely). Its perplexity is the inverse of the probability which is 6600/5 = 1320, which suggests that we are highly perplexed by this suggestion. If the likelihood of picking this token was 6000/6600 instead, then the perplexity would be 6600/6000 = 1.1 which suggests that we are just slightly perplexed by this suggestion.
Hence, the perplexity of our model on a more likely prediction is lower than the perplexity of our model on a less likely prediction.
The perplexity of predicting all tokens in a sentence “x” is formally defined as the Nth root of the inverse of the product of the token probabilities.
However, to ensure numerical stability, we can define it in terms of the log function.
Which is e (2.71828) to the power of the average negative log-likelihood of the predicted token being the ground truth token.
Training and validation perplexity
The model’s training and validation perplexity can be computed directly from the batch or epoch loss.
Prediction perplexity
One can’t compute the prediction perplexity since it requires having a set of ground truth labels for each prediction.
PyTorch code to compute perplexity
Let’s assume that the variable probs is a torch.Tensor of shape (sequence_length,), which contains the probability of the ground-truth token being predicted by the language model at that position in the sequence.
The per-token perplexity can be computed using this code.
token_perplexity = (probs.log() * -1.0).exp()
print(f”Token Perplexity: {token_perplexity}”)
The sample perplexity can be computed using this code.
# The perplexity is e^(average NLL).
sentence_perplexity = (probs.log() * -1.0).mean().exp().item()
print(f”Sentence Perplexity: {sentence_perplexity:.2f}”)
Next, let’s look at the code that computes this per-token probability given a sentence.
def get_pointwise_loss(self, inputs: List[str], tok):
self.model.eval()
all_probs = []
with torch.inference_mode():
for input in inputs:
ids_list: List[int] = tok.encode(input).ids
# ids has shape (1, len(ids_list))
ids: Torch.Tensor = torch.tensor(ids_list, device=self.device).unsqueeze(0)
# probs below is the probability that the token at that location
# completes the sentence (in ids) so far.
y = self.model(ids)
criterion = nn.CrossEntropyLoss(reduction=’none’, ignore_index=0)
# Compute the loss starting from the 2nd token in the model’s output.
loss = criterion(y[:,:,:-1], ids[:,1:])
# To compute the probability of each token, we need to compute the
# negative of the log loss and exponentiate it.
loss = loss * -1.0
# Set the probabilities that we are not interested in to -inf.
# This is done to make the softmax set these values to 0.0
loss[loss == 0.0] = float(“-inf”)
# probs holds the probability of each token’s prediction
# starting from the 2nd token since we don’t want to include
# the probability of the model predicting the beginning of
# a sentence given no existing sentence context.
#
# To compute perplexity, we should probably ignore the first
# handful of predictions of the model since there’s insufficient
# context. We don’t do that here, though.
probs = loss.exp()
all_probs.append(probs)
#
#
return all_probs
#
Now that we know something about how language models function, and how we can compute the per-token as well as per-sentence perplexity, let’s try to tie it all together and take a look at how one can leverage this information to build a tool that can detect if some text was AI-generated.
Detecting AI-generated text
We have all the ingredients we need to check if a piece of text is AI-generated. Here’s everything we need:
The text (sentence or paragraph) we wish to check.The tokenized version of this text, tokenized using the tokenizer that was used to tokenize the training dataset for this model.The trained language model.
Using 1, 2, and 3 above, we can compute the following:
Per-token probability as predicted by the model.Per-token perplexity using the per-token probability.Total perplexity for the entire sentence.The perplexity of the model on the training dataset.
To check if a text is AI-generated, we need to compare the sentence perplexity with the model’s perplexity scaled by a fudge-factor, alpha. If the sentence perplexity is more than the model’s perplexity with scaling, then it’s probably human-written text (i.e. not AI-generated). Otherwise, it’s probably AI-generated. The reason for this is that we expect the model to not be perplexed by text it would generate itself, so if it encounters some text that it itself would not generate, then there’s reason to believe that the text isn’t AI-generated. If the perplexity of the sentence is less than or equal to the model’s training perplexity with scaling, then it’s likely that it was generated using this language model, but we can’t be very sure. This is because it’s possible for a human to have written that text, and it just happens to be something that the model could also have generated. After all, the model was trained on a lot of human-written text so in some sense, the model represents an “average human’s writing”.
ppx(x) in the formula above means the perplexity of the input “x”.
Next, let’s take a look at examples of human-written v/s AI-generated text.
Examples of AI-generated v/s human written text
We’ve written some Python code that colors each token in a sentence based on its perplexity relative to the model’s perplexity. The first token is always coloured black if we don’t consider its perplexity. Tokens that have a perplexity that is less than or equal to the model’s perplexity with scaling are coloured red, indicating that they may be AI-generated, whereas the tokens with higher perplexity are coloured green, indicating that they were definitely not AI-generated.
The numbers in the square brackets before the sentence indicate the perplexity of the sentence as computed using the language model. Note that some words are part red and part blue. This is due to the fact that we used a subword tokenizer.
Here’s the code that generates the HTML above.
def get_html_for_token_perplexity(tok, sentence, tok_ppx, model_ppx):
tokens = tok.encode(sentence).tokens
ids = tok.encode(sentence).ids
cleaned_tokens = []
for word in tokens:
m = list(map(ord, word))
m = list(map(lambda x: x if x != 288 else ord(‘ ‘), m))
m = list(map(chr, m))
m = ”.join(m)
cleaned_tokens.append(m)
#
html = [
f”<span>{cleaned_tokens[0]}</span>”,
]
for ct, ppx in zip(cleaned_tokens[1:], tok_ppx):
color = “black”
if ppx.item() >= 0:
if ppx.item() <= model_ppx * 1.1:
color = “red”
else:
color = “green”
#
#
html.append(f”<span style=’color:{color};’>{ct}</span>”)
#
return “”.join(html)
#
As we can see from the examples above, if a model detects some text as human-generated, it’s definitely human-generated, but if it detects the text as AI-generated, there’s a chance that it’s not AI-generated. So why does this happen? Let’s take a look next!
False positives
Our language model is trained on a LOT of text written by humans. It’s generally hard to detect if something was written (digitally) by a specific person. The model’s inputs for training comprise many, many different styles of writing, likely written by a large number of people. This causes the model to learn many different writing styles and content. It’s very likely that your writing style very closely matches the writing style of some text the model was trained on. This is the result of false positives and why the model can’t be sure that some text is AI-generated. However, the model can be sure that some text was human-generated.
OpenAI: OpenAI recently announced that it would discontinue its tools for detecting AI-generated text, citing a low accuracy rate (Source: Hindustan Times).
The original version of the AI classifier tool had certain limitations and inaccuracies from the outset. Users were required to input at least 1,000 characters of text manually, which OpenAI then analyzed to classify as either AI or human-written. Unfortunately, the tool’s performance fell short, as it properly identified only 26 percent of AI-generated content and mistakenly labeled human-written text as AI about 9 percent of the time.
Here’s the blog post from OpenAI. It seems like they used a different approach compared to the one mentioned in this article.
Our classifier is a language model fine-tuned on a dataset of pairs of human-written text and AI-written text on the same topic. We collected this dataset from a variety of sources that we believe to be written by humans, such as the pretraining data and human demonstrations on prompts submitted to InstructGPT. We divided each text into a prompt and a response. On these prompts, we generated responses from a variety of different language models trained by us and other organizations. For our web app, we adjust the confidence threshold to keep the false positive rate low; in other words, we only mark text as likely AI-written if the classifier is very confident.
GPTZero: Another popular AI-generated text detection tool is GPTZero. It seems like GPTZero uses perplexity and burstiness to detect AI-generated text. “Burstiness refers to the phenomenon where certain words or phrases appear in bursts within a text. In other words if a word appears once in a text, it’s likely to appear again in close proximity” (source).
GPTZero claims to have a very high success rate. According to the GPTZero FAQ, “At a threshold of 0.88, 85% of AI documents are classified as AI, and 99% of human documents are classified as human.”
The generality of this approach
The approach mentioned in this article doesn’t generalize well. What we mean by this is that if you have 3 language models, for example, GPT3, GPT3.5, and GPT4, then you must run the input text through all the 3 models and check perplexity on all of them to see if the text was generated by any one of them. This is because each model generates text slightly differently, and they all need to independently evaluate text to see if any of them may have generated the text.
With the proliferation of large language models in the world as of August 2023, it seems unlikely that one can check any piece of text as having originated from any of the language models in the world.
In fact, new models are being trained every day, and trying to keep up with this rapid progress seems hard at best.
The example below shows the result of asking our model to predict if the sentences generated by ChatGPT are AI-generated or not. As you can see, the results are mixed.
The sentences in the purple box are correctly identified as AI-generated by our model, whereas the rest are incorrectly identified as human written.
There are many reasons why this may happen.
Train corpus size: Our model is trained on very little text, whereas ChatGPT was trained on terabytes of text.Data distribution: Our model is trained on a different data distribution as compared to ChatGPT.Fine-tuning: Our model is just a GPT model, whereas ChatGPT was fine-tuned for chat-like responses, making it generate text in a slightly different tone. If you had a model that generates legal text or medical advice, then our model would perform poorly on text generated by those models as well.Model size: Our model is very small (less than 100M parameters compared to > 200B parameters for ChatGPT-like models).
It’s clear that we need a better approach if we hope to provide a reasonably high-quality result to check if any text is AI-generated.
Next, let’s take a look at some misinformation about this topic circulating around the internet.
Misinformation
Some articles interpret perplexity incorrectly. For example, if you search google.com for “does human written content have high or low perplexity?”, you’ll get the following result in the first position.
This is incorrect since human-written content typically has higher perplexity compared with AI-generated content.
Let’s take a look at techniques that researchers in this area are exploring to do better than where we are.
What’s next?
We’ve established that detecting AI-generated text is a hard problem, and that success rates are so low that they aren’t any better than guessing. Let’s look at what state of the art techniques researchers in this area are exploring to get a better handle on things.
Watermarking: OpenAI and Google have promised to watermark AI generated text so that it’s possible to identify it programmatically.
The technical details of how this watermark might work are unclear and neither company has disclosed any details related to it.
Even if OpenAI and Google adopt a watermarking technique, we can’t be certain that every language model that is deployed out there will have a watermark included. It would still be possible for people to deploy their own models to generate text and put it out there in the wild. Even if companies decide to watermark generated text, it’s not clear if this will be a standard or if every company will have their own proprietary strategy and potentially paid tool to check if any text was generated using their AI-based text-generation tools. If it’s an open standard there’s a chance that people will be able to work around it unless it’s something like a cryptographic cipher that requires a ton of computation to undo. If it’s not an open standard, then people will be at the mercy of these companies to provide open and free access to the tools and APIs needed to perform these checks. There’s also the question of how effective these will be in the long run since it may even be possible to train models to ingest the AI-generated text with a watermark, and return AI-generated text without a watermark.
This article discusses a possible technique for adding a watermark for AI-generated text, and mentions the significant challenges with the approach.
Personalization: In our opinion, this problem of detecting AI-generated text is going to remain challenging in the near term. We believe that strategies will need to get more intrusive and personalized to be more effective. For example, instead of asking if some text is AI-generated it may be more reasonable to ask if this text was written by a specific person. However, this would require the system to have access to large amounts of text written by that specific person. Additionally, the problem becomes more complex if something was written by more than one person, such as this article.
Let’s look at the impact such a personalized system for detecting human-written text would have on educators and students.
If such a solution existed, educators would be more inclined to hand out individual assignments to students instead of group assignments. This would also require every student to first provide a large amount of text that they themselves have written. This could mean spending several hours in-person at a university before enrolling for classes. Surely, this would have a negative effect on the ability to teach students the importance of working together as a team to accomplish a common goal.
On the other hand, having access to AI-based text generation could free students in some cases to focus on the actual problem at hand such as performing the research or literature study instead of spending time writing out their learnings in a polished manner. One can imagine that students will end up spending more of their time learning concepts and techniques in math or science class as opposed to writing about it. That part can be taken care of by the AI.
Conclusion
In this article, we built intuition around how one can detect AI-generated text. The main metric we can use is the perplexity of the generated text. We saw some PyTorch code to check if a given text may be AI-generated using the perplexity of that text. We also saw some of the drawbacks of this approach, including the possibility of false positives. We hope this helps you understand and appreciate the nitty-gritty details behind detecting AI-generated text.
This is a constantly evolving space and researchers are trying hard to figure out a way to detect AI-generated text with higher accuracy. The impact of this technology to our lives promises to be significant, and in many ways unknown.
While we discussed the techniques for detecting AI-generated text, we assumed that the entire text is either human-written or AI-generated. In practice, text tends to be partially human-written and partially AI-generated, and this complicates things considerably.
If you want to read about additional approaches to detect AI-generated text, such as the use of the burstiness metric, you can read about them here.
All the images in this article (except for the first one) were created by the author(s).
Challenges of Detecting AI-Generated Text was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
erznnp