O’Reilly Media – Data Engineering in the Age of AI

Much like the introduction of the personal computer, the internet, and the iPhone into the public sphere, recent developments in the AI space, from generative AI to agentic AI, have fundamentally changed the way people live and work. Since ChatGPT’s release in late 2022, it’s reached a threshold of 700 million users per week, approximately 10% of the global adult population. And according to a 2025 report by Capgemini, agentic AI adoption is expected to grow by 48% by the end of the year. It’s quite clear that this latest iteration of AI technology has transformed virtually every industry and profession, and data engineering is no exception.

As Naveen Sharma, SVP and global practice head at Cognizant, observes, “What makes data engineering uniquely pivotal is that it forms the foundation of modern AI systems, it’s where these models originate and what enables their intelligence.” Thus, it’s unsurprising that the latest advances in AI would have a sizable impact on the discipline, perhaps even an existential one. With the increased adoption of AI coding tools leading to the reduction of many entry-level IT positions, should data engineers be wary about a similar outcome for their own profession? Khushbu Shah, associate director at ProjectPro, poses this very question, noting that “we’ve entered a new phase of data engineering, one where AI tools don’t just support a data engineer’s work; they start doing it for you. . . .Where does that leave the data engineer? Will AI replace data engineers?”

Despite the growing tide of GenAI and agentic AI, data engineers won’t be replaced anytime soon. While the latest AI tools can help automate and complete rote tasks, data engineers are still very much needed to maintain and implement the infrastructure that houses data required for model training, build data pipelines that ensure accurate and accessible data, and monitor and enable model deployment. And as Shah points out, “Prompt-driven tools are great at writing code but they can’t reason about business logic, trade-offs in system design, or the subtle cost of a slow query in a production dashboard.” So while their customary daily tasks might shift with the increasing adoption of the latest AI tools, data engineers still have an important role to play in this technological revolution.

The Role of Data Engineers in the New AI Era

In order to adapt to this new era of AI, the most important thing data engineers can do involves a fairly self-evident mindshift. Simply put, data engineers need to understand AI and how data is used in AI systems. As Mike Loukides, VP of content strategy at O’Reilly, put it to me in a recent conversation, “Data engineering isn’t going away, but you won’t be able to do data engineering for AI if you don’t understand the AI part of the equation. And I think that’s where people will get stuck. They’ll think, ‘Same old same old,’ and it isn’t. A data pipeline is still a data pipeline, but you have to know what that pipeline is feeding.”

So how exactly is data used? Since all models require huge amounts of data for initial training, the first stage involves collecting raw data from various sources, be they databases, public datasets, or APIs. And since raw data is often unorganized or incomplete, preprocessing the data is necessary to prepare it for training, which involves cleaning, transforming, and organizing the data to make it suitable for the AI model. The next stage concerns training the model, where the preprocessed data is fed into the AI model to learn patterns, relationships, or features. After that there’s posttraining, where the model is fine-tuned with data important to the organization that’s building the model, a stage that also requires a significant amount of data. Related to this stage is the concept of retrieval-augmented generation (RAG), a technique that provides real-time, contextually relevant information to a model in order to improve the accuracy of responses.

Other important ways that data engineers can adapt to this new environment and help support current AI initiatives is by improving and maintaining high data quality, designing robust pipelines and operational systems, and ensuring that privacy and security measures are met.

In his testimony to a US House of Representatives committee on the topic of AI innovation, Gecko Robotics cofounder Troy Demmer affirmed a golden axiom of the industry: “AI applications are only as good as the data they are trained on. Trustworthy AI requires trustworthy data inputs.” It’s the reason why roughly 85% of all AI projects fail, and many AI professionals flag it as a major source of concern: without high-quality data, even the most sophisticated models and AI agents can go awry. Since most GenAI models depend upon large datasets to function, data engineers are needed to process and structure this data so that it’s clean, labeled, and relevant, ensuring reliable AI outputs.

Just as importantly, data engineers need to design and build newer, more robust pipelines and infrastructure that can scale with Gen AI requirements. As Adi Polak, Director of AI & Data Streaming at Confluent, notes, “the next generation of AI systems requires real-time context and responsive pipelines that support autonomous decisions across distributed systems”, well beyond traditional data pipelines that can only support batch-trained models or power reports. Instead, data engineers are now tasked with creating nimbler pipelines that can process and support real-time streaming data for inference, historical data for model fine-tuning, versioning, and lineage tracking. They also must have a firm grasp of streaming patterns and concepts, from event driven architecture to retrieval and feedback loops, in order to build high-throughput pipelines that can support AI agents.

While GenAI’s utility is indisputable at this point, the technology is saddled with notable drawbacks. Hallucinations are most likely to occur when a model doesn’t have the proper data it needs to answer a given question. Like many systems that rely on vast streams of information, the latest AI systems are not immune to private data exposure, biased outputs, and intellectual property misuse. Thus, it’s up to data engineers to ensure that the data used by these systems is properly governed and secured, and that the systems themselves comply with relevant data and AI regulations. As data engineer Axel Schwanke astutely notes, these measures may include “limiting the use of large models to specific data sets, users and applications, documenting hallucinations and their triggers, and ensuring that GenAI applications disclose their data sources and provenance when they generate responses,” as well as sanitizing and validating all GenAI inputs and outputs. An example of a model that addresses the latter measures is O’Reilly Answers, one of the first models that provides citations for content it quotes.

The Road Ahead

Data engineers should remain gainfully employed as the next generation of AI continues on its upward trajectory, but that doesn’t mean there aren’t significant challenges around the corner. As autonomous agents continue to evolve, questions regarding the best infrastructure and tools to support them have arisen. As Ben Lorica ponders, “What does this mean for our data infrastructure? We are designing intelligent, autonomous systems on top of databases built for predictable, human-driven interactions. What happens when software that writes software also provisions and manages its own data? This is an architectural mismatch waiting to happen, and one that demands a new generation of tools.” One such potential tool has already arisen in the form of AgentDB, a database designed specifically to work effectively with AI agents.

In a similar vein, a recent research paper, “Supporting Our AI Overlords,” opines that data systems must be redesigned to be agent-first. Building upon this argument, Ananth Packkildurai observes that “it’s tempting to believe that the Model Context Protocol (MCP) and tool integration layers solve the agent-data mismatch problem. . . .However, these improvements don’t address the fundamental architectural mismatch. . . .The core issue remains: MCP still primarily exposes existing APIs—precise, single-purpose endpoints designed for human or application use—to agents that operate fundamentally differently.” Whatever the outcome of this debate may be, data engineers will likely help shape the future underlying infrastructure used to support autonomous agents.

Another challenge for data engineers will be successfully navigating the ever amorphous landscape of data privacy and AI regulations, particularly in the US. With the One Big Beautiful Bill Act leaving AI regulation under the aegis of individual state laws, data engineers need to keep abreast of any local legislations that might impact their company’s data use for AI initiatives, such as the recently signed SB 53 in California, and adjust their data governance strategies accordingly. Furthermore, what data is used and how it’s sourced should always be at top of mind, with Anthropic’s recent settlement of a copyright infringement lawsuit serving as a stark reminder of that imperative.

Lastly, the quicksilver momentum of the latest AI has led to an explosion of new tools and platforms. While data engineers are responsible for keeping up with these innovations, that can be easier said than done, due to steep learning curves and the time required to truly upskill in something versus AI’s perpetual wheel of change. It’s a precarious balancing act, one that data engineers must get a bead on quickly in order to stay relevant.

Despite these challenges however, the future outlook of the profession isn’t doom and gloom. While the field will undergo massive changes in the near future due to AI innovation, it will still be recognizably data engineering, as even technology like GenAI requires clean, governed data and the underlying infrastructure to support it. Rather than being replaced, data engineers are more likely to emerge as key players in the grand design of an AI-forward future.