O’Reilly Media – Structural Evolutions in Data

I am wired to constantly ask “what’s next?” Sometimes, the answer is: “more of the same.”

That came to mind when a friend raised a point about emerging technology’s fractal nature. Across one story arc, they said, we often see several structural evolutions—smaller-scale versions of that wider phenomenon.

Cloud computing? It progressed from “raw compute and storage” to “reimplementing key services in push-button fashion” to “becoming the backbone of AI work”—all under the umbrella of “renting time and storage on someone else’s computers.” Web3 has similarly progressed through “basic blockchain and cryptocurrency tokens” to “decentralized finance” to “NFTs as loyalty cards.” Each step has been a twist on “what if we could write code to interact with a tamper-resistant ledger in real-time?”

Most recently, I’ve been thinking about this in terms of the space we currently call “AI.” I’ve called out the data field’s rebranding efforts before; but even then, I acknowledged that these weren’t just new coats of paint. Each time, the underlying implementation changed a bit while still staying true to the larger phenomenon of “Analyzing Data for Fun and Profit.”

Consider the structural evolutions of that theme:

Stage 1: Hadoop and Big Data

By 2008, many companies found themselves at the intersection of “a steep increase in online activity” and “a sharp decline in costs for storage and computing.” They weren’t quite sure what this “data” substance was, but they’d convinced themselves that they had tons of it that they could monetize. All they needed was a tool that could handle the massive workload. And Hadoop rolled in.

In short order, it was tough to get a data job if you didn’t have some Hadoop behind your name. And harder to sell a data-related product unless it spoke to Hadoop. The elephant was unstoppable.

Until it wasn’t.

Hadoop’s value—being able to crunch large datasets—often paled in comparison to its costs. A basic, production-ready cluster priced out to the low-six-figures. A company then needed to train up their ops team to manage the cluster, and their analysts to express their ideas in MapReduce. Plus there was all of the infrastructure to push data into the cluster in the first place.

If you weren’t in the terabytes-a-day club, you really had to take a step back and ask what this was all for. Doubly so as hardware improved, eating away at the lower end of Hadoop-worthy work.

And then there was the other problem: for all the fanfare, Hadoop was really large-scale business intelligence (BI).

(Enough time has passed; I think we can now be honest with ourselves. We built an entire industry by … repackaging an existing industry. This is the power of marketing.)

Don’t get me wrong. BI is useful. I’ve sung its praises time and again. But the grouping and summarizing just wasn’t exciting enough for the data addicts. They’d grown tired of learning what is; now they wanted to know what’s next.

Stage 2: Machine learning models

Hadoop could kind of do ML, thanks to third-party tools. But in its early form of a Hadoop-based ML library, Mahout still required data scientists to write in Java. And it (wisely) stuck to implementations of industry-standard algorithms. If you wanted ML beyond what Mahout provided, you had to frame your problem in MapReduce terms. Mental contortions led to code contortions led to frustration. And, often, to giving up.

(After coauthoring Parallel R I gave a number of talks on using Hadoop. A common audience question was “can Hadoop run [my arbitrary analysis job or home-grown algorithm]?” And my answer was a qualified yes: “Hadoop could theoretically scale your job. But only if you or someone else will take the time to implement that approach in MapReduce.” That didn’t go over well.)

Goodbye, Hadoop. Hello, R and scikit-learn. A typical data job interview now skipped MapReduce in favor of white-boarding k-means clustering or random forests.

And it was good. For a few years, even. But then we hit another hurdle.

While data scientists were no longer handling Hadoop-sized workloads, they were trying to build predictive models on a different kind of “large” dataset: so-called “unstructured data.” (I prefer to call that “soft numbers,” but that’s another story.) A single document may represent thousands of features. An image? Millions.

Similar to the dawn of Hadoop, we were back to problems that existing tools could not solve.

The solution led us to the next structural evolution. And that brings our story to the present day:

Stage 3: Neural networks

High-end video games required high-end video cards. And since the cards couldn’t tell the difference between “matrix algebra for on-screen display” and “matrix algebra for machine learning,” neural networks became computationally feasible and commercially viable. It felt like, almost overnight, all of machine learning took on some kind of neural backend. Those algorithms packaged with scikit-learn? They were unceremoniously relabeled “classical machine learning.”

There’s as much Keras, TensorFlow, and Torch today as there was Hadoop back in 2010-2012. The data scientist—sorry, “machine learning engineer” or “AI specialist”—job interview now involves one of those toolkits, or one of the higher-level abstractions such as HuggingFace Transformers.

And just as we started to complain that the crypto miners were snapping up all of the affordable GPU cards, cloud providers stepped up to offer access on-demand. Between Google (Vertex AI and Colab) and Amazon (SageMaker), you can now get all of the GPU power your credit card can handle. Google goes a step further in offering compute instances with its specialized TPU hardware.

Not that you’ll even need GPU access all that often. A number of groups, from small research teams to tech behemoths, have used their own GPUs to train on large, interesting datasets and they give those models away for free on sites like TensorFlow Hub and Hugging Face Hub. You can download these models to use out of the box, or employ minimal compute resources to fine-tune them for your particular task.

You see the extreme version of this pretrained model phenomenon in the large language models (LLMs) that drive tools like Midjourney or ChatGPT. The overall idea of generative AI is to get a model to create content that could have reasonably fit into its training data. For a sufficiently large training dataset—say, “billions of online images” or “the entirety of Wikipedia”—a model can pick up on the kinds of patterns that make its outputs seem eerily lifelike.

Since we’re covered as far as compute power, tools, and even prebuilt models, what are the frictions of GPU-enabled ML? What will drive us to the next structural iteration of Analyzing Data for Fun and Profit?

Stage 4? Simulation

Given the progression thus far, I think the next structural evolution of Analyzing Data for Fun and Profit will involve a new appreciation for randomness. Specifically, through simulation.

You can see a simulation as a temporary, synthetic environment in which to test an idea. We do this all the time, when we ask “what if?” and play it out in our minds. “What if we leave an hour earlier?” (We’ll miss rush hour traffic.) “What if I bring my duffel bag instead of the roll-aboard?” (It will be easier to fit in the overhead storage.) That works just fine when there are only a few possible outcomes, across a small set of parameters.

Once we’re able to quantify a situation, we can let a computer run “what if?” scenarios at industrial scale. Millions of tests, across as many parameters as will fit on the hardware. It’ll even summarize the results if we ask nicely. That opens the door to a number of possibilities, three of which I’ll highlight here:

Moving beyond from point estimates

Let’s say an ML model tells us that this house should sell for $744,568.92. Great! We’ve gotten a machine to make a prediction for us. What more could we possibly want?

Context, for one. The model’s output is just a single number, a point estimate of the most likely price. What we really want is the spread—the range of likely values for that price. Does the model think the correct price falls between $743k-$746k? Or is it more like $600k-$900k? You want the former case if you’re trying to buy or sell that property.

Bayesian data analysis, and other techniques that rely on simulation behind the scenes, offer additional insight here. These approaches vary some parameters, run the process a few million times, and give us a nice curve that shows how often the answer is (or, “is not”) close to that $744k.

Similarly, Monte Carlo simulations can help us spot trends and outliers in potential outcomes of a process. “Here’s our risk model. Let’s assume these ten parameters can vary, then try the model with several million variations on those parameter sets. What can we learn about the potential outcomes?” Such a simulation could reveal that, under certain specific circumstances, we get a case of total ruin. Isn’t it nice to uncover that in a simulated environment, where we can map out our risk mitigation strategies with calm, level heads?

Moving beyond point estimates is very close to present-day AI challenges. That’s why it’s a likely next step in Analyzing Data for Fun and Profit. In turn, that could open the door to other techniques:

New ways of exploring the solution space

If you’re not familiar with evolutionary algorithms, they’re a twist on the traditional Monte Carlo approach. In fact, they’re like several small Monte Carlo simulations run in sequence. After each iteration, the process compares the results to its fitness function, then mixes the attributes of the top performers. Hence the term “evolutionary”—combining the winners is akin to parents passing a mix of their attributes on to progeny. Repeat this enough times and you may just find the best set of parameters for your problem.

(People familiar with optimization algorithms will recognize this as a twist on simulated annealing: start with random parameters and attributes, and narrow that scope over time.)

A number of scholars have tested this shuffle-and-recombine-till-we-find-a-winner approach on timetable scheduling. Their research has applied evolutionary algorithms to groups that need efficient ways to manage finite, time-based resources such as classrooms and factory equipment. Other groups have tested evolutionary algorithms in drug discovery. Both situations benefit from a technique that optimizes the search through a large and daunting solution space.

The NASA ST5 antenna is another example. Its bent, twisted wire stands in stark contrast to the straight aerials with which we are familiar. There’s no chance that a human would ever have come up with it. But the evolutionary approach could, in part because it was not limited by human sense of aesthetic or any preconceived notions of what an “antenna” could be. It just kept shuffling the designs that satisfied its fitness function until the process finally converged.

Taming complexity

Complex adaptive systems are hardly a new concept, though most people got a harsh introduction at the start of the Covid-19 pandemic. Cities closed down, supply chains snarled, and people—independent actors, behaving in their own best interests—made it worse by hoarding supplies because they thought distribution and manufacturing would never recover. Today, reports of idle cargo ships and overloaded seaside ports remind us that we shifted from under- to over-supply. The mess is far from over.

What makes a complex system troublesome isn’t the sheer number of connections. It’s not even that many of those connections are invisible because a person can’t see the entire system at once. The problem is that those hidden connections only become visible during a malfunction: a failure in Component B affects not only neighboring Components A and C, but also triggers disruptions in T and R. R’s issue is small on its own, but it has just led to an outsized impact in Φ and Σ.

(And if you just asked “wait, how did Greek letters get mixed up in this?” then … you get the point.)

Our current crop of AI tools is powerful, yet ill-equipped to provide insight into complex systems. We can’t surface these hidden connections using a collection of independently-derived point estimates; we need something that can simulate the entangled system of independent actors moving all at once.

This is where agent-based modeling (ABM) comes into play. This technique simulates interactions in a complex system. Similar to the way a Monte Carlo simulation can surface outliers, an ABM can catch unexpected or unfavorable interactions in a safe, synthetic environment.

Financial markets and other economic situations are prime candidates for ABM. These are spaces where a large number of actors behave according to their rational self-interest, and their actions feed into the system and affect others’ behavior. According to practitioners of complexity economics (a study that owes its origins to the Sante Fe Institute), traditional economic modeling treats these systems as though they run in an equilibrium state and therefore fails to identify certain kinds of disruptions. ABM captures a more realistic picture because it simulates a system that feeds back into itself.

Smoothing the on-ramp

Interestingly enough, I haven’t mentioned anything new or ground-breaking. Bayesian data analysis and Monte Carlo simulations are common in finance and insurance. I was first introduced to evolutionary algorithms and agent-based modeling more than fifteen years ago. (If memory serves, this was shortly before I shifted my career to what we now call AI.) And even then I was late to the party.

So why hasn’t this next phase of Analyzing Data for Fun and Profit taken off?

For one, this structural evolution needs a name. Something to distinguish it from “AI.” Something to market. I’ve been using the term “synthetics,” so I’ll offer that up. (Bonus: this umbrella term neatly includes generative AI’s ability to create text, images, and other realistic-yet-heretofore-unseen data points. So we can ride that wave of publicity.)

Next up is compute power. Simulations are CPU-heavy, and sometimes memory-bound. Cloud computing providers make that easier to handle, though, so long as you don’t mind the credit card bill. Eventually we’ll get simulation-specific hardware—what will be the GPU or TPU of simulation?—but I think synthetics can gain traction on existing gear.

The third and largest hurdle is the lack of simulation-specific frameworks. As we surface more use cases—as we apply these techniques to real business problems or even academic challenges—we’ll improve the tools because we’ll want to make that work easier. As the tools improve, that reduces the costs of trying the techniques on other use cases. This kicks off another iteration of the value loop. Use cases tend to magically appear as techniques get easier to use.

If you think I’m overstating the power of tools to spread an idea, imagine trying to solve a problem with a new toolset while also creating that toolset at the same time. It’s tough to balance those competing concerns. If someone else offers to build the tool while you use it and road-test it, you’re probably going to accept. This is why these days we use TensorFlow or Torch instead of hand-writing our backpropagation loops.

Today’s landscape of simulation tooling is uneven. People doing Bayesian data analysis have their choice of two robust, authoritative offerings in Stan and PyMC3, plus a variety of books to understand the mechanics of the process. Things fall off after that. Most of the Monte Carlo simulations I’ve seen are of the hand-rolled variety. And a quick survey of agent-based modeling and evolutionary algorithms turns up a mix of proprietary apps and nascent open-source projects, some of which are geared for a particular problem domain.

As we develop the authoritative toolkits for simulations—the TensorFlow of agent-based modeling and the Hadoop of evolutionary algorithms, if you will—expect adoption to grow. Doubly so, as commercial entities build services around those toolkits and rev up their own marketing (and publishing, and certification) machines.

Time will tell

My expectations of what to come are, admittedly, shaped by my experience and clouded by my interests. Time will tell whether any of this hits the mark.

A change in business or consumer appetite could also send the field down a different road. The next hot device, app, or service will get an outsized vote in what companies and consumers expect of technology.

Still, I see value in looking for this field’s structural evolutions. The wider story arc changes with each iteration to address changes in appetite. Practitioners and entrepreneurs, take note.

Job-seekers should do the same. Remember that you once needed Hadoop on your résumé to merit a second look; nowadays it’s a liability. Building models is a desired skill for now, but it’s slowly giving way to robots. So do you really think it’s too late to join the data field? I think not.

Keep an eye out for that next wave. That’ll be your time to jump in.