Gergely Schmidt | April 26, 2022
5 Minutes Read
AI is more than just hype. Early adopters benefit substantially over their competitors. According to Gartner, “By 2025, the 10% of enterprises that establish AI engineering best practices will generate at least three times more value from their AI efforts than the 90% of enterprises that do not.” Every organization will have to adapt but the path is less obvious than the goal.
AI models rely on data. To put it simply, organizations have to possess enough data of the right quality to be able to build. If an organization has built a data warehouse or data lake in recent years, it’s a great place to start. If not, things are a little more complicated. Regardless of the foundations, it is recommended to start with simple rule-based processes built on available data. This will help to decide where the biggest gaps are with the data and the process. As Google best practices puts it, “You can take data from a different problem and then tweak the model for a new product, but this will likely underperform basic heuristics. If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.” The bottom line is it’s not just okay but strongly recommended to start with a heuristic approach rather than ML right away.
It is important to have a clear understanding of how AI models will be utilized and their relationship to the overall business of the company. If it is a strategic and unique differentiator, then it is obvious to build it in-house to keep full control and the option to customize it.
For example, Netflix could lose $1bn or more every year from subscribers quitting its service if it weren't for its personalized recommendation engine. In the beginning, the company tried to outsource the recommendation model by running a competition for a million dollars. The winner was announced, but the model was never actually used in production. It was clear that this model couldn’t be built by outsiders. Recommendation models are Netflix’s strategic advantage over their competitors and something that required multiple engineering teams working on analyzing the habits of its 200+ million subscribers.
On the contrary, an airline would use ML models to segment its users in the booking process. Based on that, recommending more legroom or better food during flights instead of just sandwiches is something that increases revenue. These two additional options are not an airline’s core business. This is not a strategic and unique differentiator for the company, it is simply an optimization of the process.
Let’s say you decide to embark on building AI models from scratch. What risks and challenges will you face?
Shortage of talented data scientists
First of all, you need to hire talented engineers and data scientists, which is a major up-front investment. Harvard Business Review in 2012 named “data scientist” the sexiest job of the twenty-first century for a good reason. Being a data scientist means wearing a lot of different hats that might not be relevant for a specific job: from putting together complex spreadsheets to creating and deploying deep neural networks into production. Finding data scientists whose skills are up-to-date with the latest technologies and at the same time have hands-on experience is still hard. On the one hand, the fast-paced advancement of AI-related technologies in the past few years has made it hard for individuals to keep up. On the other hand, the increasing demand for those professionals has created an extremely competitive hiring landscape,making staff retention even more challenging.
Lack of good data
According to Gartner, just 53% of ML prototypes are eventually deployed to production. Among the various reasons for this, two major challenges stand out: the lack of sufficient quality and amount of data. Data in general should be heterogeneous and free from discrepancies. Data should also be plentiful enough to analyze statistically. ML models are as good as the underlying data. As the saying goes, “garbage in, garbage out.”
It can happen that data quality and quantity are not good enough to build an ML model with the proper expected outcome. For example, if you choose to buy a model instead of building it; but you will know about that much sooner and at a lower cost, possibly free of charge during the trial period.
Deploying a model to production is hard
Let's look at this stat again: Almost half of all models never reach production stage. This is critical, as it is one of the main reasons why a lot of businesses still struggle to get a grip on using ML. From a management perspective it means that half of the effort is pretty much wasted. Understanding why this happens will help to decide if building or buying is the right direction for your organization.
This diagram refers to a paper from Google called the Hidden Technical Debt in Machine Learning Systems. It paints a very good picture of how many additional areas have to be managed other than simply building a model. In fact, coding is often the easiest part. An MLOps solution could usually manage this complex process, but these solutions are oftentimes expensive and complicated to use if they’re implemented for custom models.
To make things even more complicated, in most cases additional knowledge is required to put models into production. Data Scientists build the models themselves but when they are ready, an ML engineer (someone with a strong engineering skill set and statistical knowledge) does all the other areas around it. The bottom line is that the process of productionalizing an ML model is a lot more complex than most businesses foresee when they embark on the journey.
Operational costs are high
“The moment you put a model in production, it starts degrading.” This sums up the problem nicely. Models have to be constantly monitored and automatically retrained, which increases the costs after the model starts running in production. In most cases this is not calculated in the original Total Cost of Ownership (TCO) of building a model in-house; if the solution is bought from the market, this is usually taken care of by the service provider. So, when making a comparison between the two approaches it is important to understand the end-to-end costs for both.
Business value comes slower
Hiring talent, analyzing data, building models, and deploying and then maintaining them requires a lot of time and effort. This means that business value might come much later than originally expected, if it comes at all. Budgets and timelines might run high, or run out, and it will be hard to explain the sunk cost. Remember, almost half of all models never reach production.
Lack of customizability
Out-of-the-box Software as a Service solutions are rarely customizable for specific client needs, so they need to be strongly evaluated beforehand. These models serve a general purpose and work well with standard input data like GA360 to minimize the integration period and deliver business value faster.
Lack of transparency, blackboxes
ML models are considered blackboxes because decision trees are generated with computers: You put useful data in and then you get useful data out. But how do you know it is not a Mechanical Turk? The intellectual property of companies providing Models as a Service lies in the models and the operations around them; they are not going to share their most valuable asset with anyone.
You might like to know how a recommendation system segments your users before recommending anything to them so that you uncover new aspects of your users but with a closed system that is not something that would be available for you.
Honorary mentions to both build and buy: Information security concerns
By uploading data, especially personal identifiable information, to someone else’s servers or cloud and letting them process it, you’re putting all your trust in the provider’s security protocols. It takes a huge effort to anonymize or encrypt data before uploading it to another service and then decrypting when it comes back. Anonymizing data is also a pain because it is going to be really hard to analyze suspicious activity or deepdive into data issues that occur during regular business processes. This is another aspect that has to be taken into account when deciding the approach.
Deciding to build or buy is not straightforward. When we work with our clients to decide which path they should embark on, we regularly look at the risks and benefits and create a scorecard that helps to decide the right direction.
In many cases the solution becomes a hybrid. Organizations start with a relatively low-risk Model as a Service and understand the business benefits that ML can drive for them. Later, if a clear need arises for more customization, it is easier to scale up than scale down after making all the painful and expensive upfront investments we’ve highlighted.
We love to talk about all things AI. Let us know where your organization is on your ML journey, and we can create a personalized scorecard for free. This will give you a better idea of whether you're spending your budget in the right place.
AutoML: An Introduction To Get You Started
4 Minutes Read
AutoML is an exciting new trend in the Machine Learning (ML) industry. It revolves around analyzing data automatically and getting meaningful insights with minimum effort. By using the ingested data it is also capable of building models which can later be used as predictors for new data points.
Google BigQuery materialized view test drive
3 Minutes Read
I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.
Employee well-being initiatives: Creating an engaged workforce
5 Minutes Read
In my previous blog post, I shared how important it is for us to provide a flexible and healthy working environment for our employees. In addition to having an open policy on home office, we feel that as an employer, it’s our responsibility to help our team maintain their physical and mental health.