A couple of days ago, I was thinking about what you needed to know to use ChatGPT (or Bing/Sydney, or any similar service). It’s easy to ask it questions, but we all know that these large language models frequently generate false answers. Which raises the question: If I ask ChatGPT something, how much do I need to know to determine whether the answer is correct?
So I did a quick experiment. As a short programming project, a number of years ago I made a list of all the prime numbers less than 100 million. I used this list to create a 16-digit number that was the product of two 8-digit primes (99999787 times 99999821 is 9999960800038127). I then asked ChatGPT whether this number was prime, and how it determined whether the number was prime.
ChatGPT correctly answered that this number was not prime. This is somewhat surprising because, if you’ve read much about ChatGPT, you know that math isn’t one of its strong points. (There’s probably a big list of prime numbers somewhere in its training set.) However, its reasoning was incorrect–and that’s a lot more interesting. ChatGPT gave me a bunch of Python code that implemented the Miller-Rabin primality test, and said that my number was divisible by 29. The code as given had a couple of basic syntactic errors–but that wasn’t the only problem. First, 9999960800038127 isn’t divisible by 29 (I’ll let you prove this to yourself). After fixing the obvious errors, the Python code looked like a correct implementation of Miller-Rabin–but the number that Miller-Rabin outputs isn’t a factor, it’s a “witness” that attests to the fact the number you’re testing isn’t prime. The number it outputs also isn’t 29. So ChatGPT didn’t actually run the program; not surprising, many commentators have noted that ChatGPT doesn’t run the code that it writes. It also misunderstood what the algorithm does and what its output means, and that’s a more serious error.
I then asked it to reconsider the rationale for its previous answer, and got a very polite apology for being incorrect, together with a different Python program. This program was correct from the start. It was a brute-force primality test that tried each integer (both odd and even!) smaller than the square root of the number under test. Neither elegant nor performant, but correct. But again, because ChatGPT doesn’t actually run the program, it gave me a new list of “prime factors”–none of which were correct. Interestingly, it included its expected (and incorrect) output in the code:
n = 9999960800038127
factors = factorize(n)
print(factors) # prints [193, 518401, 3215031751]
I’m not claiming that ChatGPT is useless–far from it. It’s good at suggesting ways to solve a problem, and can lead you to the right solution, whether or not it gives you a correct answer. Miller-Rabin is interesting; I knew it existed, but wouldn’t have bothered to look it up if I wasn’t prompted. (That’s a nice irony: I was effectively prompted by ChatGPT.)
Getting back to the original question: ChatGPT is good at providing “answers” to questions, but if you need to know that an answer is correct, you must either be capable of solving the problem yourself, or doing the research you’d need to solve that problem. That’s probably a win, but you have to be wary. Don’t put ChatGPT in situations where correctness is an issue unless you’re willing and able to do the hard work yourself.