O’Reilly Media – Getting the Right Answer from ChatGPT

A couple of days ago, I was thinking about what you needed to know to use ChatGPT (or Bing/Sydney, or any similar service). It’s easy to ask it questions, but we all know that these large language models frequently generate false answers. Which raises the question: If I ask ChatGPT something, how much do I need to know to determine whether the answer is correct?

So I did a quick experiment. As a short programming project, a number of years ago I made a list of all the prime numbers less than 100 million. I used this list to create a 16-digit number that was the product of two 8-digit primes (99999787 times 99999821 is 9999960800038127). I then asked ChatGPT whether this number was prime, and how it determined whether the number was prime.

ChatGPT correctly answered that this number was not prime. This is somewhat surprising because, if you’ve read much about ChatGPT, you know that math isn’t one of its strong points. (There’s probably a big list of prime numbers somewhere in its training set.) However, its reasoning was incorrect–and that’s a lot more interesting. ChatGPT gave me a bunch of Python code that implemented the Miller-Rabin primality test, and said that my number was divisible by 29. The code as given had a couple of basic syntactic errors–but that wasn’t the only problem. First, 9999960800038127 isn’t divisible by 29 (I’ll let you prove this to yourself). After fixing the obvious errors, the Python code looked like a correct implementation of Miller-Rabin–but the number that Miller-Rabin outputs isn’t a factor, it’s a “witness” that attests to the fact the number you’re testing isn’t prime. The number it outputs also isn’t 29. So ChatGPT didn’t actually run the program; not surprising, many commentators have noted that ChatGPT doesn’t run the code that it writes. It also misunderstood what the algorithm does and what its output means, and that’s a more serious error.

I then asked it to reconsider the rationale for its previous answer, and got a very polite apology for being incorrect, together with a different Python program. This program was correct from the start. It was a brute-force primality test that tried each integer (both odd and even!) smaller than the square root of the number under test. Neither elegant nor performant, but correct. But again, because ChatGPT doesn’t actually run the program, it gave me a new list of “prime factors”–none of which were correct. Interestingly, it included its expected (and incorrect) output in the code:

      n = 9999960800038127
      factors = factorize(n)
      print(factors) # prints [193, 518401, 3215031751]

I’m not claiming that ChatGPT is useless–far from it. It’s good at suggesting ways to solve a problem, and can lead you to the right solution, whether or not it gives you a correct answer. Miller-Rabin is interesting; I knew it existed, but wouldn’t have bothered to look it up if I wasn’t prompted. (That’s a nice irony: I was effectively prompted by ChatGPT.)

Getting back to the original question: ChatGPT is good at providing “answers” to questions, but if you need to know that an answer is correct, you must either be capable of solving the problem yourself, or doing the research you’d need to solve that problem. That’s probably a win, but you have to be wary. Don’t put ChatGPT in situations where correctness is an issue unless you’re willing and able to do the hard work yourself.

Latest from Google AI – Hybrid Quantum Algorithms for Quantum Monte Carlo

Posted by William J. Huggins, Research Scientist, Google Quantum AI The intersection between the computational difficulty and practical importance of quantum chemistry challenges run on quantum computers has long been a focus for Google Quantum AI. We’ve experimentally simulated simple models of chemical bonding, high-temperature superconductivity, nanowires, and even exotic phases of matter such as…

Artificial Intelligence

Latest from MIT Tech Review – OpenAI’s new agent can compile detailed reports on practically any topic

OpenAI has launched a new agent capable of conducting complex, multi-step online research into everything from scientific research to personalized bike recommendations at what it claims is the same level as a human research analyst. The tool, called Deep Research, is powered by a version of OpenAI’s o3 reasoning model that’s been optimized for web…

Artificial Intelligence

Latest from Google AI – Quantum Advantage in Learning from Experiments

Posted by Jarrod McClean, Staff Research Scientist, Google Quantum AI, and Hsin-Yuan Huang, Graduate Student, Caltech In efforts to learn about the quantum world, scientists face a big obstacle: their classical experience of the world. Whenever a quantum system is measured, the act of this measurement destroys the “quantumness” of the state. For example, if…

Artificial Intelligence

Latest from MIT Tech Review – By putting AI into everything, Google wants to make it invisible

If you want to know where AI is headed, this year’s Google I/O has you covered. The company’s annual showcase of next-gen products, which kicked off yesterday, has all of the pomp and pizzazz, the sizzle reels and celebrity walk-ons, that you’d expect from a multimillion dollar marketing event. But it also shows us just…

Artificial Intelligence

Latest from MIT Tech Review – The Download: DeepMind’s AI shortcomings, and China’s social media translation problem

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. The hype around DeepMind’s new AI model misses what’s actually cool about it Earlier this month, DeepMind presented a new “generalist” AI model called Gato. The model can play the video game Atari, caption…

Artificial Intelligence

Latest from MIT : Solving a machine-learning mystery

Large language models like OpenAI’s GPT-3 are massive neural networks that can generate human-like text, from poetry to programming code. Trained using troves of internet data, these machine-learning models take a small bit of input text and then predict the text that is likely to come next. But that’s not all these models can do….

Similar Posts