A few months ago, I wrote about some experiments with prime numbers. I generated a 16-digit non-prime number by multiplying two 8-digit prime numbers, and asked ChatGPT (using GPT -3.5) whether the larger number was prime. It answered correctly that the number was non-prime, but when it told me the number’s prime factors, it was clearly wrong. It also generated a short program that implemented the widely used Miller-Rabin primality test. After fixing some obvious errors, I ran the program–and while it told me (correctly) that my number was non-prime, when compared to a known good implementation of Miller-Rabin, ChatGPT’s code made many mistakes. When it became available, GPT-4 gave me similar results. And the result itself–well, that could have been a good guess. There’s a roughly a 97% chance that a randomly chosen 16-digit number will be non-prime.
OpenAI recently opened their long-awaited Plugins feature to users of ChatGPT Plus (the paid version) using the GPT-4 model. One of the first plugins was from Wolfram, the makers of Mathematica and Wolfram Alpha. I had to try this! Specifically, I was compelled to re-try my prime test. And everything worked: ChatGPT sent the problem to Wolfram, it determined that number was not prime, and gave me the correct prime factors. It didn’t generate any code, but provided a link to the Wolfram Alpha result page that described how to test for primality. The process of going through ChatGPT to Wolfram and back was also painfully slow, much slower than using Wolfram Alpha directly or writing a few lines of Python. But it worked and, for fans of prime numbers, that’s a plus.
I was still uncomfortable. How does ChatGPT decide what to offload to Wolfram Alpha, and what to handle on its own? I tried a few questions from calculus; unsurprisingly, they went to Wolfram. Then I got really simple: “How much is 3 + 5?” No Wolfram, and I wasn’t surprised when ChatGPT told me the answer was 8. But that begged the question: what about more complex arithmetic? So I asked “How much is 123456789 + 98776543321?”, a problem that could be solved by any elementary school student who has learned how to carry. Again, no Wolfram, but this time, the answer was incorrect.
We’ve long known that ChatGPT was poor at arithmetic, in addition to being poor at more advanced math. The Wolfram plugin solves the math problem with ease. However, ChatGPT is still poor at arithmetic, and still attempts to do arithmetic on its own. The important question that I can’t answer is “when does a problem become complex enough to send to the plugin?” The plugin is a big win, but not an unqualified one.
ChatGPT’s tendency to make up citations is another well-known problem. A few weeks ago, a story circulated about a lawyer who used ChatGPT to write a brief. ChatGPT cited a lot of case law, but made up all the citations. When a judge asked him to produce the actual case law, the lawyer went back to ChatGPT–which obediently made up the cases themselves. The judge was not pleased. That raises another question: ChatGPT has always been prone to making up citations–but now there’s a plugin for that! The ScholarAI plugin searches academic databases for citations, and returns links. That wouldn’t have helped this lawyer (I don’t yet see plugins from Westlaw or LexisNexis), but it’s worth asking: what about citations?
I first tried asking a medical question. I’m not a doctor, so the question was simple: what’s the latest research on antibiotic-resistant bacteria? ChatGPT sent the question to ScholarAI, and I got back a long list of relevant citations. (The plugin appeared to get into a loop, so I eventually terminated the output.) While I’m not competent to evaluate the quality or relevance of the papers, all the links were valid: the papers were real, and the author names were correct. No hallucinations here.
I followed up with some questions about English literature (I have a PhD, so I can make up real questions). I didn’t get as many citations in return, possibly because we don’t have preprint servers like ArXiv, and have done little to protest journals’ proprietary lock on scholarship. However, the citations I got were valid: real books and articles, with the authors listed correctly.
That begged another question, though. A list of articles is certainly useful, but you still have to read them all to write the paper. Could ChatGPT write an essay for me? I asked it to write about colonialism in the work of Salman Rushdie, and got a passable short essay. There were citations, and they were real; ChatGPT didn’t link to the publications cited, but Google made it easy to find them. The resulting essay didn’t demonstrate any familiarity with the articles beyond the abstract–fair enough, since for most of the sources, the abstract was all that was publicly available. More to the point, the article didn’t really make any connections to Rushdie’s fiction. There were many sentences like this: “Hamish Dalley discusses the role of the historical novel in postcolonial writing, a genre to which many of Rushdie’s works belong.” True, but that doesn’t say much about either Rushdie’s work or Dalley’s. As I said, the essay was passable, but if I had to grade it, the student who turned it in wouldn’t have been happy. Still, ChatGPT and ScholarAI get credit for doing a decent literature search that could be the basis for an excellent paper. And if a student took this initial prompt, read the academic articles along with Rushdie’s novels, and used that to write a more detailed prompt telling ChatGPT exactly what points he wanted to make, with relevant quotations, the result could have been excellent. An essay isn’t an exercise in providing N*1000 words; it’s the outcome of a thought process that involves engaging with the subject matter. If ChatGPT and ScholarAI facilitate that engagement, I wouldn’t object. But let’s be clear: regardless of who generates the words, ChatGPT’s users still have to do the reading and thinking.
As with the Wolfram plugin, it’s helpful to understand when ChatGPT is using ScholarAI, and when it isn’t. I asked ChatGPT to find articles by me; when using the plugin, it couldn’t find any, although it apologetically gave me a list of articles whose authors had the first name Michael. The sad list of Michael-authored articles notwithstanding, I’ll count that response as “correct.” I haven’t published any academic papers, though I have published a lot on O’Reilly Radar–material that any web search can find, without the need for AI or the risk of hallucination.
If you dig a bit deeper, the results are puzzling. If you use ChatGPT with plugins enabled and write a prompt that tells it not to use the plugin, it comes up empty, but suggests that you research online databases like Google Scholar. If you start a new conversation and do not enable plugins (plugins can only be enabled or disabled at the start of a conversation), you still get nothing–but ChatGPT does tell you that Michael Loukides is a well-known author who has frequently written for O’Reilly, and to check on the O’Reilly website for articles. (It isn’t clear whether these different responses have to do with the state of the plugin, or the way ChatGPT randomizes its output.) Flattery will get you somewhere, I suppose, but not very far. My publication history with O’Reilly goes back to the 1990s, and is all public; it’s not clear why ChatGPT is unaware of it. Starting a new conversation with Bing searches enabled got me a list of valid links to articles that I’ve written–but I shouldn’t have had to try three times, the process was much slower than searching with Bing (or Google) directly, and it wasn’t clear why some articles were included and some weren’t. And you really do have to try multiple times: you can’t use both Bing searches and plugins in the same conversation.
As with the Wolfram plugin, ScholarAI is a big improvement–but again, not an unqualified one. You still have to know whether the content you’re looking for is in an academic journal, on the web, or somewhere else. While ChatGPT tells you when it is using a plugin, and which plugin it is using, you can’t always predict what it will do in advance–and when it doesn’t use a plugin, ChatGPT is vulnerable to the same errors we’ve come to expect. You still have to experiment, and you still have to check the results.
As another test, I used the Kayak plugin to check out flights for some trips I might take. The plugin does a good job with major airports (including smaller ones), though it seemed to be hit-or-miss with very small airports, like New Haven (HVN). That’s a limitation of Kayak, rather than the plugin itself or ChatGPT. You currently have to enable the plugins you’re going to use at the start of each conversation, and ChatGPT doesn’t allow you to enable competing plugins. You can install both Kayak and Expedia, but you can only use one in any chat. I wouldn’t be surprised if this behavior changes as plugins mature.
Finally: all the plugins I installed were free of charge. However, I don’t think it’s called the “plugin store” for nothing. It wouldn’t surprise me to see charges for plugins, and I would be surprised if some plugins eventually require a subscription to a paid account. A number of the plugins access subscription-based services; I expect that subscriptions will be required once we are out of the Beta period.
I’m excited that plugins have finally arrived. Plugins are still in beta, so their behavior will almost certainly change; the behaviors I’ve described may have changed by the time you read this. Several changed while I was writing this article. Plugins certainly don’t eliminate the need to be careful about hallucinations and other kinds of errors, nor do they replace the need for thinking. But it’s hard to understate how important it is that ChatGPT can now reach out and access current data. When ChatGPT was limited to data before November 2021, it was an intriguing toy. It’s looking more and more like a tool.