Skip Navigation
BlackBerry Blog

Beyond a Top-Notch Data Science Team

/ 10.19.17 / Hailey Buckingham

Supporting your data science team takes a lot more than finding top notch data scientists, data science project managers, and directors. Those things are crucial, but are only the beginning of the story.

When it comes to world class data science operations, you've got to be thinking outside of the data science box just as much as inside. For the sake of this article, we're talking about the adjacent teams; all those people and products that depend on data science, and just as critically, the people and products on which data science depends.

The powers of the magical, mathematical “black box” of a high-end data science team can only be properly leveraged if it's plugged in to an equally high performing ecosystems of adjacent teams and tech.

Data Pipeline

Data science magic can't happen without data. This usually goes without saying. But in the world of data acquisition and delivery, there's a lot to think about, and a lot of ways that the data stream can run dry, leaving your data science team with a lot of unused cycles, or worse, frustration.

In your own context, who is responsible for things like finding new data, processing incoming data points, and building software for data extraction? Are all these tasks being tackled piecemeal by your data science team? If so, why not have a dedicated team (or sub-team) for it?

By spinning up a committed group of sharp devs to handle your incoming data, you gain some powerful advantages. Chief among these are reliable, consistent, and replicable data sourcing. Rapid experimentation by your data science team isn't possible if they can't get truly consistent, replicable data sets to experiment on.

A second, related benefit, is that as your creative folks realize new possibilities for data extraction or input processing, they'll have a dedicated team to lean on. Instead of having the experimenters switch gears to build new infrastructure, your data pipeline team, who already knows their tools inside out (they built them!), can do it in parallel; speeding the process, and giving your whole operation faster turnaround on new research.

Another benefit that may escape first notice, is that in contexts where your downstream data products ALSO need data pre-processing, a dedicated software team upstream of data science may be better suited to hand those tools off to your product teams than your data scientists are.

Those same benefits of reliability and replicability will be immediately at hand for your product teams to leverage as well. And adding to this, you can get support for those data processing components from this team, without having that duty fall on your researchers.

A pitfall to avoid, as with any interacting teams, is siloing. Once you spin up your new data pipeline team, it'll be mandatory (not just recommended, trust us) that they have full, open, regular communication with the rest of the data science team. Rapid iteration won't happen if the teams can't click together like my favorite kid toys (And nothing's more frustrating when two pieces that are SUPPOSED to fit together, but won't. Don't go down this path!).

Invest in relationships across these teams, and provide the resources they need for seamless communication and interaction. You'll see the ROI soon enough.

Data Infrastructure

Data science takes a lot of compute time. Compute time comes at a cost of human time. Someone, somewhere, has to spin up the infrastructure needed by your data science team. In a lot of shops, it seems like this bit is, unfortunately, done by the data science team themselves (or so the blogosphere would have us believe).

To be sure, cloud computing, parallel computing, and the rest have all gotten a lot easier in recent years. Still, if you delve a little deeper into all the claims of "Seamless data infrastructure at your fingertips!", you'll discover the gray world of headaches that, while less traumatizing than a decade ago, can slow down even the most intrepid data scientist.

But don't despair! It turns out that there are folks out there who not only have the know-how to stand up your infrastructure, but actually, secretly enjoy it. Find some of these folks, and attach them to your data science team. "That's devops' job," you say? You're right, of course, but consider how many people devops is already supporting (and where would we all be without them?).

If the data magic stalls and sputters because of infrastructure that refuses all sensible configuration files, you might have wished you had a couple of dedicated devops witches brewing up their solutions directly for data science. But the benefits don't stop here. What you may find is that after installing these folks directly into your data science team, you'll start seeing OTHER gains as well.

After the obvious infrastructure woes are solved, this same team will still be able to take over the day-to-day headaches, freeing your researchers do spend more of their cycles doing what they (and you) love most; math magic.

And remember, just because you've got some high-performers who CAN do all the things, it doesn't necessarily follow that they do their best work that way. There's a reason divide-and-conquer strategies work so well!

Data Products

Data science solely for the sake of sweet, sweet models is fun; but is the realm of weekend projects and data competitions. Data science for the sake of data PRODUCTS is what we're really here for though, isn't it? Who's building those products? Who's supporting those products? When your data science team develops a cutting-edge model, bound to seriously up your corporate game, are the downstream teams ready for the handoff? Can their existing code base even HANDLE the new model and its esoteric functions? Whose job is it to make sure this process goes smoothly?

Data products are the reason most of us are here, so forgetting to invest in top-notch inter-team operations is a certain way to find pain for yourself, your teams, and your stakeholders. If this sounds a lot like what I was saying about Data Pipeline teams, that's because it IS the same. A high-powered data science engine won't take you anywhere without a fuel delivery system (data input) and a racing transmission (data products) to turn all that power into industry-disrupting acceleration. Just as with data pipeline, real investment in cross-team relationships, teamwork, and planning are paramount.

The racecar analogy works well here: if you invest in the engine and powertrain, but keep the commuter car transmission, you've spent a lot of time on money on a lot of burnt metal. You'll upgrade in the end (if your business is still in the race at that point), so why wait till your inter-team gears are melting? You'll be happiest when your data science and data products teams work together seamlessly.

Data Products, Take Two

There's another place that those data products are important. Surely, data science companies concentrate heavily on downstream consumers of their data products. Again, that's really why we're here, right? But what about the UPSTREAM users of the data products? Upstream? Yep. That's right. The primary upstream teams are Data Pipeline, Data Infrastructure, and Data Science proper, and guess what, they need those packaged data products just as much as anyone.

Don't treat your data products like fire-and-forget missiles. Your data science team will use them for everything from benchmarking new models, researching past machine learning bugs for future improvements, comparing the sweet spots of multiple past models to produce better new models, and, if you missed the first one, benchmarking.

As magical as a data science team might seem to some, we can't produce an ever-upward series of models without a well-honed testing environment. Using those older models is crucial in the process of vetting new ones. Are the older models readily available? Is data science using the in-production models, or their own, pre-production code? Do you know if there are any differences in the outputs from your data science code and data product code?

These don't have to be hard questions, but if left unasked for too long, can seriously come back to bite you. So, if you're not tired of hearing it yet, invest in inter-team communication and process! Insert old-timey cliché about "ounces" and "planning" and such, here.

In Conclusion

Your data science team doesn't exist in a vacuum. To get the best results out of your larger data science roadmap, make sure you pay close, careful attention to your data science-adjacent teams, and how those teams are (or aren't) interacting with each other.

Invest in inter-team coordination, teamwork and comradery, and keep a constant eye out for signs of newly developed silos. Without tending and care, it's normal in some tech cultures for communication and collaboration to slowly dwindle over time. But with proper TLC, it's not hard to tune those systems back up, and in the end, you'll be very glad you did!

Hailey Buckingham

About Hailey Buckingham

Data Scientist at Cylance

Hailey Buckingham is a Data Scientist at Cylance where she and her team builds the AI machines at the heart of Cylance’s anti-malware products. She’s used mathematics, statistics, machine learning and programming throughout her career, across many disciplines, including agriculture, ecology, forestry, and now cybersecurity.