O’Reilly Media – Can Language Models Replace Compilers?

Kevlin Henney and I recently discussed whether automated code generation, using some future version of GitHub Copilot or the like, could ever replace higher-level languages. Specifically, could ChatGPT N (for large N) quit the game of generating code in a high-level language like Python, and produce executable machine code directly, like compilers do today?

It’s not really an academic question. As coding assistants become more accurate, it seems likely to assume that they will eventually stop being “assistants” and take over the job of writing code. That will be a big change for professional programmers—though writing code is a small part of what programmers actually do. To some extent, it’s happening now: ChatGPT 4’s “Advanced Data Analysis” can generate code in Python, run it in a sandbox, collect error messages, and try to debug it. Google’s Bard has similar capabilities. Python is an interpreted language, so there’s no machine code, but there’s no reason this loop couldn’t incorporate a C or C++ compiler.

This kind of change has happened before: in the early days of computing, programmers “wrote” programs by plugging in wires, then by toggling in binary numbers, then by writing assembly language code, and finally (in the late 1950s) using early programming languages like COBOL (1959) and FORTRAN (1957). To people who programmed using circuit diagrams and switches, these early languages looked as radical as programming with generative AI looks today. COBOL was—literally—an early attempt to make programming as simple as writing English.

Kevlin made the point that higher-level languages are a “repository of determinism” that we can’t do without—at least, not yet. While a “repository of determinism” sounds a bit evil (feel free to come up with your own name), it’s important to understand why it is needed. At almost every stage of programming history, there has been a repository of determinism. When programmers wrote in assembly language, they had to look at the binary 1s and 0s to see exactly what the computer was doing. When programmers wrote in FORTRAN (or, for that matter, C), the repository of determinism moved higher: the source code expressed what programmers wanted and it was up to the compiler to deliver the correct machine instructions. However, the status of this repository was still shaky. Early compilers were not as reliable as we’ve come to expect. They had bugs, particularly if they were optimizing your code (were optimizing compilers a forerunner of AI?). Portability was problematic at best: every vendor had its own compiler, with its own quirks and its own extensions. Assembly was still the “court of last resort” for determining why your program didn’t work. The repository of determinism was only effective for a single vendor, computer, and operating system.1 The need to make higher-level languages deterministic across computing platforms drove the development of language standards and specifications.

These days, very few people need to know assembler. You need to know assembler for a few tricky situations when writing device drivers, or to work with some dark corners of the operating system kernel, and that’s about it. But while the way we program has changed, the structure of programming hasn’t. Especially with tools like ChatGPT and Bard, we still need a repository of determinism, but that repository is no longer assembly language. With C or Python, you can read a program and understand exactly what it does. If the program behaves in unexpected ways, it’s much more likely that you’ve misunderstood some corner of the language’s specification than that the C compiler or Python interpreter got it wrong. And that’s important: that’s what allows us to debug successfully. The source code tells us exactly what the computer is doing, at a reasonable layer of abstraction. If it’s not doing what we want, we can analyze the code and correct it. That may require rereading Kernighan and Ritchie, but it’s a tractable, well-understood problem. We no longer have to look at the machine language—and that’s a very good thing, because with instruction reordering, speculative execution, and long pipelines, understanding a program at the machine level is a lot more difficult than it was in the 1960s and 1970s. We need that layer of abstraction. But that abstraction layer must also be deterministic. It must be completely predictable. It must behave the same way every time you compile and run the program.

Why do we need the abstraction layer to be deterministic? Because we need a reliable statement of exactly what the software does. All of computing, including AI, rests on the ability of computers to do something reliably and repeatedly, millions, billions, or even trillions of times. If you don’t know exactly what the software does—or if it might do something different the next time you compile it—you can’t build a business around it. You certainly can’t maintain it, extend it, or add new features if it changes whenever you touch it, nor can you debug it.

Automated code generation doesn’t yet have the kind of reliability we expect from traditional programming; Simon Willison calls this “vibes-based development.” We still rely on humans to test and fix the errors. More to the point: you’re likely to generate code many times en route to a solution; you’re not likely to take the results of your first prompt and jump directly into debugging any more than you’re likely to write a complex program in Python and get it right the first time. Writing prompts for any significant software system isn’t trivial; the prompts can be very lengthy, and it takes several tries to get them right. With the current models, every time you generate code, you’re likely to get something different. (Bard even gives you several alternatives to choose from.) The process isn’t repeatable. How do you understand what the program is doing if it’s a different program each time you generate and test it? How do you know whether you’re progressing towards a solution if the next version of the program may be completely different from the previous?

It’s tempting to think that this variation is controllable by setting a variable like GPT-4’s “temperature” to 0; “temperature” controls the amount of variation (or originality, or unpredictability) between responses. But that doesn’t solve the problem. Temperature only works within limits, and one of those limits is that the prompt must remain constant. Change the prompt to help the AI generate correct or well-designed code, and you’re outside of those limits. Another limit is that the model itself can’t change—but models change all the time, and those changes aren’t under the programmer’s control. All models are eventually updated, and there’s no guarantee that the code produced will stay the same across updates to the model. An updated model is likely to produce completely different source code. That source code will need to be understood (and debugged) on its own terms.

So the natural language prompt can’t be the repository of determinism. This doesn’t mean that AI-generated code isn’t useful; it can provide a good starting point to work from. But at some point, programmers need to be able to reproduce and reason about bugs: that’s the point at which you need repeatability, and can’t tolerate surprises. Also at that point, programmers will have to refrain from regenerating the high-level code from the natural language prompt. The AI is effectively creating a first draft, and that may (or may not) save you effort, compared to starting from a blank screen. Adding features to go from version 1.0 to 2.0 raises a similar problem. Even the largest context windows can’t hold an entire software system, so it’s necessary to work one source file at a time—exactly the way we work now, but again, with the source code as the repository of determinism. Furthermore, it’s difficult to tell a language model what it’s allowed to change, and what should remain untouched: “modify this loop only, but not the rest of the file” may or may not work.

This argument doesn’t apply to coding assistants like GitHub Copilot. Copilot is aptly named: it’s an assistant to the pilot, not the pilot. You can tell it precisely what you want done, and where. When you use ChatGPT or Bard to write code, you’re not the pilot or the copilot; you’re the passenger. You can tell a pilot to fly you to New York, but from then on, the pilot is in control.

Will generative AI ever be good enough to skip the high-level languages and generate machine code? Can a prompt replace code in a high-level language? After all, we’re already seeing a tools ecosystem that has prompt repositories, no doubt with version control. It’s possible that generative AI will eventually be able to replace programming languages for day-to-day scripting (“Generate a graph from two columns of this spreadsheet”). But for larger programming projects, keep in mind that part of human language’s value is its ambiguity, and a programming language is valuable precisely because it isn’t ambiguous. As generative AI penetrates further into programming, we will undoubtedly see stylized dialects of human languages that have less ambiguous semantics; those dialects may even become standardized and documented. But “stylized dialects with less ambiguous semantics” is really just a fancy name for prompt engineering, and if you want precise control over the results, prompt engineering isn’t as simple as it seems. We still need a repository of determinism, a layer in the programming stack where there are no surprises, a layer that provides the definitive word on what the computer will do when the code executes. Generative AI isn’t up to that task. At least, not yet.

Footnote

If you were in the computing industry in the 1980s, you may remember the need to “reproduce the behavior of VAX/VMS FORTRAN bug for bug.”