Edgar Prunk-Eger | January 29, 2020
4 Minutes Read
What is AutoML and what is it used for?
AutoML is a recent artificial-intelligence-based solution which has started to gain popularity because of its easy application and great variety of use cases. The core idea – as indicated by the name Automated Machine Learning – is that you can apply ML to real-life problems even if you’re not highly skilled in this field. Due to the high degree of automation involved, this new solution can be exciting and useful if you don’t want to become an expert in ML but want to apply it to a real-life problem you’re facing. The key idea behind AutoML is that you take your data, ingest it into a piece of software or service which automatically analyzes the data. Then the system gives you a fully implemented machine learning model. You can use this model for anything that an ML model can be used for, for example to predict target values from features, for basic classification or regression problems, for object detection in visual data etc.
AutoML and Google Cloud
Now let’s dive into Google Cloud AutoML specifically. Google Cloud AutoML has a wide range of services:
Proof of concept
Let’s assume our company has some data. We need to decide if it’s possible to use that data to predict some important target variable. With AutoML we can make that decision.. Basically, it gives us a strong indication as to whether it is worth even starting an ML project or not, based on the dataset with the desired target value. Let’s look at a more concrete example. Let’s assume a webshop needs to decide, based on historical purchase data, if it is possible to predict how many days will pass until one specific customer returns to the shop and buys something else. With AutoML it is possible to get a fairly reliable indication of how accurate this kind of forecast can be.
In this use case, the company already has an ongoing ML project or is just starting one. AutoML can bring a baseline model to the table here. For the Data Science team, this will provide guidance as to what possibilities are available with the data. Depending on the specific data, the predictive performance of the AutoML model is sometimes powerful and hard to reproduce manually by Data Scientists. At other times it is not that hard, because the predictive performance is not that powerful. But either way, there is proof that, at least in theory, it is possible to reproduce the predictive performance of the AutoML model. Here I have to mention that some of the algorithms used by Google AI Platform’s AutoML incorporate pretrained models. This means that there is a possibility that it’s not just our data that is used to build the model, but information from other sources is also incorporated. For an inhouse Data Science team, therefore, it is practically impossible to reproduce that predictive performance.
Deploy to production
Of course it is also possible to use AutoML as an end-to-end solution. With this goal in mind, you can upload the dataset, do the model training, then deploy it and use it in production. This means that anyone can have a very powerful image detection system for example, with close to zero development time. The Google Cloud Platform will provide a REST API for the trained model which can be used by any other backend system or application. Let’s see AutoML in action!
One cool feature of Google’s AutoML is to classify images into distinct categories, using a model which was built by our sample data. We can take the well-known Kaggle Flower recognition dataset prepared by Alexander Mamaev and build an AutoML solution on that. This dataset contains five kinds of flowers with a total of 4242 individual images.
The first thing we have to do is to upload all the images to Google Cloud Storage. With the gsutil tool provided by Google, this step is really straightforward. Then we have to prepare a list of all the images labeled correctly by the kind of flower it contains. This file should be saved in CSV format, and also can be saved to the cloud storage, next to the images. The last part is to import the dataset, train and then evaluate the model. These steps require just a few clicks on the user interface, but depending on the dataset it can take a while to complete, typically a few hours. When the model training is finished, we see something similar in the Cloud Console: Google Cloud
As you can see, the trained model is about 94%‒96% confident in distinguishing between the images. With all this set up, the model can be deployed, which means Google will provide a REST API service to make predictions on new images. This service can be useful for example on an ecommerce website to auto classify images uploaded by their users.
AutoML is attracting more and more attention because there are multiple layers of possible use cases in a machine learning project. Every project can make use of it to some extent. Also it is really easy to explore the possibilities with near-instantaneous results. Considering all this, it should be no surprise that Google keeps adding new features, like the recently added AutoML Tables for tabular data.
Google BigQuery materialized view test drive
3 Minutes Read
I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.
Employee well-being initiatives: Creating an engaged workforce
5 Minutes Read
In my previous blog post, I shared how important it is for us to provide a flexible and healthy working environment for our employees. In addition to having an open policy on home office, we feel that as an employer, it’s our responsibility to help our team maintain their physical and mental health.
Part 1: Customer lifetime value estimation via probabilistic modeling
15 Minutes Read
Customer lifetime value (CLV) is the total worth of a customer to a company over the length of their relationship. In practice, this "worth" can be defined as revenue, profit, or any metric of an analyst's choosing.