O’Reilly Media – Automating the Automators: Shift Change in the Robot Factory

What would you say is the job of a software developer? A layperson, an entry-level developer, or even someone who hires developers will tell you that job is to … well … write software. Pretty simple.

An experienced practitioner will tell you something very different. They’d say that the job involves writing some software, sure. But deep down it’s about the purpose of software. Figuring out what kinds of problems are amenable to automation through code. Knowing what to build, and sometimes what not to build because it won’t provide value.

They may even summarize it as: “my job is to spot for() loops and if/then statements in the wild.”

I, thankfully, learned this early in my career, at a time when I could still refer to myself as a software developer. Companies build or buy software to automate human labor, allowing them to eliminate existing jobs or help teams to accomplish more. So it behooves a software developer to spot what portions of human activity can be properly automated away through code, and then build that.

This mindset has followed me into my work in ML/AI. Because if companies use code to automate business rules, they use ML/AI to automate decisions.

Given that, what would you say is the job of a data scientist (or ML engineer, or any other such title)?

I’ll share my answer in a bit. But first, let’s talk about the typical ML workflow.

Building Models

A common task for a data scientist is to build a predictive model. You know the drill: pull some data, carve it up into features, feed it into one of scikit-learn’s various algorithms. The first go-round never produces a great result, though. (If it does, you suspect that the variable you’re trying to predict has mixed in with the variables used to predict it. This is what’s known as a “feature leak.”) So now you tweak the classifier’s parameters and try again, in search of improved performance. You’ll try this with a few other algorithms, and their respective tuning parameters–maybe even break out TensorFlow to build a custom neural net along the way–and the winning model will be the one that heads to production.

You might say that the outcome of this exercise is a performant predictive model. That’s sort of true. But like the question about the role of the software developer, there’s more to see here.

Collectively, your attempts teach you about your data and its relation to the problem you’re trying to solve. Think about what the model results tell you: “Maybe a random forest isn’t the best tool to split this data, but XLNet is.” If none of your models performed well, that tells you that your dataset–your choice of raw data, feature selection, and feature engineering–is not amenable to machine learning. Perhaps you need a different raw dataset from which to start. Or the necessary features simply aren’t available in any data you’ve collected, because this problem requires the kind of nuance that comes with a long career history in this problem domain. I’ve found this learning to be a valuable, though often understated and underappreciated, aspect of developing ML models.

Second, this exercise in model-building was … rather tedious? I’d file it under “dull, repetitive, and predictable,” which are my three cues that it’s time to automate a task.

Dull: You’re not here for the model itself; you’re after the results. How well did it perform? What does that teach me about my data?Repetitive: You’re trying several algorithms, but doing roughly the same thing each time.Predictable: The scikit-learn classifiers share a similar interface, so you can invoke the same train() call on each one while passing in the same training dataset.

Yes, this calls for a for() loop. And data scientists who came from a software development background have written similar loops over the years. Eventually they stumble across GridSearchCV, which accepts a set of algorithms and parameter combinations to try. The path is the same either way: setup, start job, walk away. Get your results in a few hours.

Building a Better for() loop for ML

All of this leads us to automated machine learning, or autoML. There are various implementations–from the industrial-grade AWS SageMaker Autopilot and Google Cloud Vertex AI, to offerings from smaller players–but, in a nutshell, some developers spotted that same for() loop and built a slick UI on top. Upload your data, click through a workflow, walk away. Get your results in a few hours.

If you’re a professional data scientist, you already have the knowledge and skills to test these models. Why would you want autoML to build models for you?

It buys time and breathing room. An autoML solution may produce a “good enough” solution in just a few hours. At best, you’ll get a model you can put in production right now (short time-to-market), buying your team the time to custom-tune something else (to get better performance). At worst, the model’s performance is terrible, but it only took a few mouse clicks to determine that this problem is hairier than you’d anticipated. Or that, just maybe, your training data is no good for the challenge at hand.It’s convenient. Damn convenient. Especially when you consider how Certain Big Cloud Providers treat autoML as an on-ramp to model hosting. It takes a few clicks to build the model, then another few clicks to expose it as an endpoint for use in production. (Is autoML the bait for long-term model hosting? Could be. But that’s a story for another day.) Related to the previous point, a company could go from “raw data” to “it’s serving predictions on live data” in a single work day.You have other work to do. You’re not just building those models for the sake of building them. You need to coordinate with stakeholders and product managers to suss out what kinds of models you need and how to embed them into the company’s processes. And hopefully they’re not specifically asking you for a model, but asking you to use the company’s data to address a challenge. You need to spend some quality time understanding all of that data through the lens of the company’s business model. That will lead to additional data cleaning, feature selection, and feature engineering. Those require the kind of context and nuance that the autoML tools don’t (and can’t) have.

Software Is Hungry, May as Well Feed It

Remember the old Marc Andreessen line that software is eating the world?

More and more major businesses and industries are being run on software and delivered as online services — from movies to agriculture to national defense. Many of the winners are Silicon Valley-style entrepreneurial technology companies that are invading and overturning established industry structures. Over the next 10 years, I expect many more industries to be disrupted by software, with new world-beating Silicon Valley companies doing the disruption in more cases than not.

This was the early days of developers spotting those for() loops and if/then constructs in the wild. If your business relied on a hard-and-fast rule, or a predictable sequence of events, someone was bound to write code to do the work and throw that on a few dozen servers to scale it out.

And it made sense. People didn’t like performing the drudge work. Getting software to take the not-so-fun parts separated duties according to ability: tireless repetition to the computers, context and special attention to detail to the humans.

Andreessen wrote that piece more than a decade ago, but it still holds. Software continues to eat the world’s dull, repetitive, predictable tasks. Which is why software is eating AI.

(Don’t feel bad. AI is also eating software, as with GitHub’s Copilot. Not to mention, some forms of creative expression. Stable Diffusion, anyone? The larger lesson here is that automation is a hungry beast. As we develop new tools for automation, we will bring more tasks within automation’s reach.)

Given that, let’s say that you’re a data scientist in a company that’s adopted an autoML tool. Fast-forward a few months. What’s changed?

Your Team Looks Different

Introducing autoML into your workflows has highlighted three roles on your data team. The first is the data scientist who came from a software development background, someone who’d probably be called a “machine learning engineer” in many companies. This person is comfortable talking to databases to pull data, then calling Pandas to transform it. In the past they understood the APIs of TensorFlow and Torch to build models by hand; today they are fluent in the autoML vendor’s APIs to train models, and they understand how to review the metrics.

The second is the experienced ML professional who really knows how to build and tune models. That model from the autoML service is usually good, but not great, so the company still needs someone who can roll up their sleeves and squeeze out the last few percentage points of performance. Tool vendors make their money by scaling a solution across the most common challenges, right? That leaves plenty of niches the popular autoML solutions can’t or won’t handle. If a problem calls for a shiny new technique, or a large, branching neural network, someone on your team needs to handle that.

Closely related is the third role, someone with a strong research background. When the well-known, well-supported algorithms no longer cut the mustard, you’ll need to either invent something whole cloth or translate ideas out of a research paper. Your autoML vendor won’t offer that solution for another couple of years, so, it’s your problem to solve if you need it today.

Notice that a sufficiently experienced person may fulfill multiple roles here. It’s also worth mentioning that a large shop probably needed people in all three roles even before autoML was a thing.

(If we twist that around: aside from the FAANGs and hedge funds, few companies have both the need and the capital to fund an ongoing ML research function. This kind of department provides very lumpy returns–the occasional big win that punctuates long stretches of “we’re looking into it.”)

That takes us to a conspicuous omission from that list of roles: the data scientists who focused on building basic models. AutoML tools are doing most of that work now, in the same way that the basic dashboards or visualizations are now the domain of self-service tools like AWS QuickSight, Google Data Studio, or Tableau. Companies will still need advanced ML modeling and data viz, sure. But that work goes to the advanced practitioners.

In fact, just about all of the data work is best suited for the advanced folks. AutoML really took a bite out of your entry-level hires. There’s just not much for them to do. Only the larger shops have the bandwidth to really bring someone up to speed.

That said, even though the team structure has changed, you still have a data team when using an autoML solution. A company that is serious about doing ML/AI needs data scientists, machine learning engineers, and the like.

You Have Refined Your Notion of “IP”

The code written to create most ML models was already a commodity. We’re all calling into the same Pandas, scikit-learn, TensorFlow, and Torch libraries, and we’re doing the same “convert data into tabular format, then feed to the algorithm” dance. The code we write looks very similar across companies and even industries, since so much of it is based on those open-source tools’ call semantics.

If you see your ML models as the sum total of algorithms, glue code, and training data, then the harsh reality is that your data was the only unique intellectual property in the mix anyway. (And that’s only if you were building on proprietary data.) In machine learning, your competitive edge lies in business know-how and ability to execute. It does not exist in the code.

AutoML drives this point home. Instead of invoking the open-source scikit-learn or Keras calls to build models, your team now goes from Pandas data transforms straight to … the API calls for AWS AutoPilot or GCP Vertex AI. The for() loop that actually builds and evaluates the models now lives on someone else’s systems. And it’s available to everyone.

Your Job Has Changed

Building models is still part of the job, in the same way that developers still write a lot of code. While you called it “training an ML model,” developers saw “a for() loop that you’re executing by hand.” It’s time to let code handle that first pass at building models and let your role shift accordingly.

What does that mean, then? I’ll finally deliver on the promise I made in the introduction. As far as I’m concerned, the role of the data scientist (and ML engineer, and so on) is built on three pillars:

Translating to numbers and back. ML models only see numbers, so machine learning is a numbers-in, numbers-out game. Companies need people who can translate real-world concepts into numbers (to properly train the models) and then translate the models’ numeric outputs back into a real-world context (to make business decisions). Your model says “the price of this house should be $542,424.86”? Great. Now it’s time to explain to stakeholders how the model came to that conclusion, and how much faith they should put in the model’s answer.Understanding where and why the models break down: Closely related to the previous point is that models are, by definition, imperfect representations of real-world phenomena. When looking through the lens of your company’s business model, what is the impact of this model being incorrect? (That is: what model risk does the company face?)

My friend Roger Magoulas reminded me of the old George Box quote that “all models are wrong, but some are useful.” Roger emphasized that we must consider the full quote, which is:

Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.

Spotting ML opportunities in the wild: Machine learning does four things well: prediction (continuous outputs), classification (discrete outputs), grouping things (“what’s similar?”), and catching outliers (“where’s the weird stuff?”). In the same way that a developer can spot for() loops in the wild, experienced data scientists are adept at spotting those four use cases. They can tell when a predictive model is a suitable fit to augment or replace human activity, and more importantly, when it’s not.

Sometimes this is as straightforward as seeing where a model could guide people. Say you overhear the sales team describing how they lose so much time chasing down leads that don’t work. The wasted time means they miss leads that probably would have panned out. “You know … Do you have a list of past leads and how they went? And are you able to describe them based on a handful of attributes? I could build a model to label a deal as a go/no-go. You could use the probabilities emitted alongside those labels to prioritize your calls to prospects.”

Other times it’s about freeing people from mind-numbing work, like watching security cameras. “What if we build a model to detect motion in the video feed? If we wire that into an alerts system, our staff could focus on other work while the model kept a watchful eye on the factory perimeter.”

And then, in rare cases, you sort out new ways to express ML’s functionality. “So … when we invoke a model to classify a document, we’re really asking for a single label based on how it’s broken down the words and sequences in that block of text. What if we go the other way? Could we feed a model tons of text, and get it to produce text on demand? And what if that could apply to, say, code?”

It Always Has Been

From a high level, then, the role of the data scientist is to understand data analysis and predictive modeling, in the context of the company’s use cases and needs. It always has been. Building models was just on your plate because you were the only one around who knew how to do it. By offloading some of the model-building work to machines, autoML tools remove some of that distraction, allowing you to focus more on the data itself.

The data is certainly the most important part of all this. You can consider the off-the-shelf ML algorithms (available as robust, open-source implementations) and unlimited compute power (provided by cloud services) as constants. The only variable in your machine learning work–the only thing you can influence in your path to success–is the data itself. Andrew Ng emphasizes this point in his drive for data-centric AI, and I wholeheartedly agree.

Making the most of that data will require that you understand where it came from, assess its quality, and engineer it into features that the algorithms can use. This is the hard part. And it’s the part we can’t yet hand off to a machine. But once you’re ready, you can hand those features off to an autoML tool–your trusty assistant that handles the grunt work–to diligently use them to train and compare various models.

Software has once again eaten dull, repetitive, predictable tasks. And it has drawn a dividing line, separating work based on ability.

Where to Next?

Some data scientists might claim that autoML is taking their job away. (We will, for the moment, skip past the irony of someone in tech complaining that a robot is taking their job.) Is that true, though? If you feel that building models is your job, then, yes.

For the more experienced readers, autoML tools are a slick replacement for their trusty-but-rusty homegrown for() loops. A more polished solution for doing a first pass at building models. They see autoML tools, not as a threat, but as a force multiplier that will test a variety of algorithms and tuning parameters while they tackle the important work that actually requires human nuance and experience. Pay close attention to this group, because they have the right idea.

The data practitioners who embrace autoML tools will use their newfound free time to forge stronger connections to the company’s business model. They’ll look for novel ways to apply data analysis and ML models to products and business challenges, and try to find those pockets of opportunity that autoML tools can’t handle.

If you have entrepreneurship in your blood, you can build on that last point and create an upstart autoML company. You may hit on something the big autoML vendors don’t currently support, and they’ll acquire you. (I currently see an opening for clustering-as-a-service, in case you’re looking for ideas.) Or if you focus on a niche that the big players deem too narrow, you may get acquired by a company in that industry vertical.

Software is hungry. Find ways to feed it.