ChatGPT outperforms college students in entry-level courses, but falls short later

Aerial view of a classroom full of students at their desks, concentrating on writing on papers.

“Since the rise of large language models such as ChatGPT, there have been many anecdotal reports of students submitting AI-generated work as exam assignments and achieving good grades. So we tested our university’s exam system against AI cheating in a controlled experiment,” says Peter Scarfe, a researcher at the University of Reading’s School of Psychology and Clinical Language Sciences.

His team created more than 30 fake psychology student accounts and used them to submit ChatGPT-4-produced answers to exam questions. The anecdotal reports were true: the use of AI went virtually unnoticed, and on average, ChatGPT scored higher than human students.

Rules of Engagement

The Scare team presented AI-generated work in five university modules, covering the classes required during the three years of study to obtain a bachelor’s degree in psychology. The tasks consisted of 200-word responses to short questions or more elaborate essays of approximately 1,500 words. “The test scorers did not know about the experiment. In some ways, the study participants didn’t know they were participating in the study, but we have the necessary permissions to move forward,” says Scare.

Shorter submissions were prepared by simply copying and pasting the exam questions into ChatGPT-4 along with a prompt to keep the response under 160 words. Essays were requested in the same way, but the required word count was increased to 2,000. By setting the limits in this way, the Scare team was able to get ChatGPT-4 to produce content close enough to the required length. “The idea was to submit those responses without any editing, other than the essays, where we applied minimal formatting,” Scare says.

In total, Scare and his colleagues fed 63 AI-generated submissions into the exam system. Even without editing or efforts to hide the use of AI, 94 percent of them went undetected and nearly 84 percent scored better (about half a grade better) than a randomly selected group of students who took the same exam.

“We did a series of briefings with the people who were grading those exams, and they were pretty surprised,” Scare says. Part of the reason they were surprised was that most of the AI submissions that were detected didn’t end up being flagged because they were too repetitive or robotic — they were flagged because they were too good.

Which raises the question: What do we do about it?

AI Search Software

“During this study, we did a lot of research into techniques for detecting AI-generated content,” says Scarfe. One such tool is Open AI’s GPTZero; others include AI writing detection systems such as the one created by Turnitin, a company specialising in providing tools for detecting plagiarism.

“The problem with these tools is that they typically work well in a lab, but their performance drops significantly in the real world,” Scarfe explained. Open AI claims that GPTZero can flag AI-generated text as “likely” AI 26 percent of the time, with a rather worrying false positive rate of 9 percent. Turnitin’s system, on the other hand, was advertised as being able to detect 97 percent of texts created in ChatGPT and GPT-3 in a lab with only one false positive in a hundred attempts. But, according to Scarfe’s team, the released beta version of this system performed significantly worse.

Source link