OpenAI Shuts Down AI-Written Text Detector. Why Should We Believe Other AI Detection Software Works?
False positives are common and problematic
Can machines really tell whether content is written by a generative AI tool? The answer seems to be no, despite companies that sell these verification tools suggesting extremely high success rates. In January, OpenAI introduced an AI classifier to identify AI-written text (make sure you read the second excerpt below for the news). The original blog post said:
We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe good classifiers can inform mitigations for false claims that AI-generated text was written by a human: for example, running automated misinformation campaigns, using AI tools for academic dishonesty, and positioning an AI chatbot as a human.
Our classifier is not fully reliable [emphasis OpenAI]. In our evaluations on a “challenge set” of English texts, our classifier correctly identifies 26% of AI-written text (true positives) as “likely AI-written,” while incorrectly labeling human-written text as AI-written 9% of the time (false positives). Our classifier’s reliability typically improves as the length of the input text increases. Compared to our previously released classifier, this new classifier is significantly more reliable on text from more recent AI systems.
So, OpenAI, a company that arguably knows more about how large language models work, said that it was correct 26% of the time in detecting AI-written content. It was wrong one-in-eleven times in assigning an AI label to entirely human-written content. That was in January 2023. Surely, six months later, it has improved. Well, no.
Last week, OpenAI quietly added a note to the original blog post introducing the AI Classifier, indicating its withdrawal.
As of July 20, 2023, the AI classifier is no longer available due to its low rate of accuracy. We are working to incorporate feedback and are currently researching more effective provenance techniques for text, and have made a commitment to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated.
OpenAI’s Caution is Uncommon
Companies that have created standalone solutions for identifying AI-written text have not expressed similar caution. Consider this statement in March from the chief product officer at Turnitin:
Let’s understand what a false positive in AI writing detection means. A false positive refers to incorrectly identifying fully human-written text as AI-generated.
It’s first important to emphasize that Turnitin’s AI writing detection focuses on accuracy—if we say there’s AI writing, we’re very sure there is. Our efforts have primarily been on ensuring a high accuracy rate accompanied by a less than 1% false positive rate, to ensure that students are not falsely accused of any misconduct.
That is a pretty strong claim about a low false positive rate. I received an early copy of the press release about the new solution and posed questions to the PR representative that were never answered. Here are some of the specific questions.
Can you provide data that validates the 98% [accuracy] claim? OpenAI has said its own tools are only about 25% accurate and they know the models better than anyone. Given such a broad disparity I'd like to see the data behind the 98% confidence score.
Is the 98% applicable across all domains or is it specific to some topics or categories?
What are the common characteristics of the items that fall into the 2% failure rate?
Based on the statement, I believe the 98% score is related to correctly identifying AI-generated work. What is the rate of false positives on human generated work that the system marks as AI-generated? These numbers will be different or you have a very unlikely coincidence.
I noticed in the image that the scoring is 75% of the submissions is "generated by AI." So, does the 98% confidence score mean your system is 98% confident some portion of the piece was generated by AI and the 75% estimate is a separate calculation or is it that the system is 98% confident that 75% of the article was generated by AI? Or, maybe something else.
The PR representative never bothered to answer these questions, though I am certain she saw the request. Nor did she bother to reply to my follow-up request. Why avoid these very basic questions? Big claims demand evidence, particularly when creators of the core technology are expressing caution, and the impact on falsely accused students could be severe.
Should potential customers be wary of the claims if the evidence is not forthcoming? Is, “trust us” an appropriate response?
Subscribe to Synthedia. It’s free! Get a data-driven perspective on the technology, applications, and industry.
Wapo Contests the Claims
Maybe one reason the PR rep was initially hesitant to respond is that they are not educated in interpreting statistics in this domain. I fully expected her to ask the team for clarification and the data. However, my initial request was on March 29th, and the follow-up was in early April, so I think there has been sufficient time to respond. This strikes me as a choice to avoid transparency.
Another reason may be The Washington Post’s April 3rd story calling Turnitin’s claims into question.
High school senior Lucy Goetz got the highest possible grade on an original essay she wrote about socialism. So imagine her surprise when I told her that a new kind of educational software I’ve been testing claimed she got help from artificial intelligence.
A new AI-writing detector from Turnitin — whose software is already used by 2.1 million teachers to spot plagiarism — flagged the end of her essay as likely being generated by ChatGPT.
I asked Turnitin for early access to its software. Five high school students, including Goetz, volunteered to help me test it by creating 16 samples of real, AI-fabricated and mixed-source essays to run past Turnitin’s detector.
The result? It got over half of them at least partly wrong. Turnitin accurately identified six of the 16 — but failed on three, including a flag on 8 percent of Goetz’s original essay. And I’d give it only partial credit on the remaining seven, where it was directionally correct but misidentified some portion of ChatGPT-generated or mixed-source writing.
Turnitin claims its detector is 98 percent accurate overall. And it says situations such as what happened with Goetz’s essay, known as a false positive, happen less than 1 percent of the time, according to its own tests.
Turnitin’s detector faces other important technical limitations, too. In the six samples it got completely right, they were all clearly 100 percent student work or produced by ChatGPT. But when I tested it with essays from mixed AI and human sources, it often misidentified the individual sentences or missed the human part entirely. And it couldn’t spot the ChatGPT in papers we ran through Quillbot, a paraphrasing program that remixes sentences.
“I am worried they’re marketing it as a precision product, but they’re using dodgy language about how it shouldn’t be used to make decisions,” [said Ian Linkletter, emerging technology and open-education librarian at the British Columbia Institute of Technology]. “They’re working at an accelerated pace not because there is any desperation to get the product out but because they’re terrified their existing product is becoming obsolete.”
So, should we believe claims by a company with an economic interest that its solution works nearly flawlessly, or should we consider data accumulated through direct use?
The good news for Turnitin is that their solution may not be as bad as others. TechCrunch reported in February an analysis of AI-writing detection tools OpenAI classifier, AI Writing Check, GPTZero, Copyleaks, GPT Radar, CatchGPT, Originality.ai:
After all that testing, what conclusions can we draw? Generally speaking, AI-text detectors do a poor job of … well, detecting. GPTZero was the only consistent performer, classifying AI-generated text correctly five out of seven times. As for the rest … not so much. CatchGPT was second best in terms of accuracy with four out of seven correct classifications, while the OpenAI classifier came in distant third with one out of seven.
Most of the detection solutions are wrong more than half of the time. That was also true of The Washington Post’s ad hoc test of Turnitin. OpenAI, the developer of the most widely used large language model (LLM) that generates AI-written text, says its classifier was only correct about 26% of the time.
If you are correct 1-in-4 times, the results are inferior to a coin flip. The same is true if your accuracy is 50%. Let that sink in. Unless a solution is above 50% accuracy, you are better off flipping a coin.
Let’s say Turnitin has a false positive rate of 2%. If you write 50 papers that are assessed for AI content, you are likely to be falsely accused of submitting an AI-written paper at least once. However, the 2% false positive rate is likely to be too optimistic based on the evidence we have seen. You can imagine if the negative impact is closer to a 25% or 50% failure rate. That would call into question the usefulness and ethics of using a tool like this.
Where Does This Matter
AI fails are often driven by enthusiasm for test results that seem to support a claim of efficacy. I’d like to think that most of these claims that turn out to be untrue are simply mistakes, as opposed to a willful attempt to mislead. However, I also recognize that few companies are as comfortable as OpenAI in admitting failure. Many people will be incentivized to “fake it until they [hopefully] make it,” or until they land a new job.
Another question that no one seems to be asking is how important is it to know whether content was fully or partially created by AI? I suspect this concern will wane over time because AI will be at least partially contributing to most content. The more important question will be to have transparency about the content publisher. Who is standing behind the content is more important than whether they received assistance creating it.
How many articles “written” by CEOs of large companies are fully composed by them? Very few. In this case, they are receiving assistance from other writers. However, the CEOs are accountable because they put their names on the articles. How is using ChatGPT different than this practice? Again, what’s important is the message and who is standing behind it … most of the time.
However, there are exceptions. Education is certainly one of those exceptions. In that case, the provenance of the content is as important as the person taking responsibility for the information behind it. Otherwise, educators cannot access learning and achievement.
The Russell Group of universities in the UK announced AI principles that encourage using and mastering generative AI tools, but doing so ethically. They want students to know how and when to employ generative AI, but not to do so when it is improper or prohibited.
Many people are approaching generative AI as an entirely new phenomenon. Remember that students could turn in papers written by another student or professional writer. They can now use generative AI. These issues are not novel.
Some educators have concluded that they may need to change their assessment techniques to account for the potential for improper use of generative AI tools. Oral exams have been suggested as an option to measure knowledge mastery. This is also a method that educational institutions once embraced but have largely abandoned.
So, when you approach questions about detecting AI-generated content, avoid the knee-jerk reaction that accepts claims it introduces new problems and creates existential risks for education assessment or other activities. Neither is likely to be true. Assess how important the need is depending on the use case and the alternative approaches.
Also, demand evidence to support claims that software can detect AI-written content with high accuracy. The track record suggests skepticism should be your default position.