Here’s Part II. of my series about Machine Learning and personalization – read the first article here.
Recommendations are not as vanilla as you may think
When building a recommendation engine, the very first question to address is the What: what to recommend? What products would fit our customers’ tastes or needs?
A lot of different Machine Learning models have been developed to address the What question.
On one hand, the multitude of models gives us room to maneuver. On the other hand, selecting the right approaches from this multitude may not be easy for the inexperienced practitioners.
The most resourceful domain remains collaborative filtering: from simple matrix factorizations (e.g. the very vanilla Scikit Learn SVD) to more sophisticated approaches (e.g. Tensorflow Neural Collaborative Filtering).
These easy-to-use models have experienced some resurgence since the arrival of Netflix. They’ve been extensively enhanced by the academic community and are still exploited by large streamed content companies (Netflix, Spotify, etc.)
The main upside is that one input data is needed: customers’ past behaviors (purchases, consumption, etc.). In other words, there’s no need for user profile data or product description/content data.
However, this upside can also be a downside, especially for companies launching brand new product segments and facing the cold start issue as collaborative filtering techniques cannot be used with too little customer data.
The other downside is the model’s interpretability: embeddings into lower-dimensional latent space of factors, both for customer behavior and product interactions, cannot be interpreted in terms of qualitative (business or actionable) insights.
Depending on the type of product being recommended, content-based filtering techniques can be used.
This is especially the case for media content like articles, where simple Latent Dirichlet Allocation maps article content to a latent topic space. User preferences can then also be described in terms of an average topic mix of articles that each user has read in the past.
Content-based filtering approaches are usually used hand-in-hand with collaborative filtering approaches.
One great upside of content-based filtering is the absence of any cold start issue regarding new content: new content can be easily recommended to existing customers. However, the cold start issue regarding customers still exists: a new customer cannot be taken into account without a prior history.
Interpretation is also easier. Content-based models like Latent Dirichlet Allocation can be viewed as a soft clustering approach for content belonging to multiple groups. This contrasts with K-Means, for instance, where each piece of content would have only belonged to one group.
Furthermore, customer preference vectors can be fed into a hard clustering approach (e.g. K-Means, spectral clustering, with cosine similarity) in order to find customers with similar preferences.
Qualitative insights can be extracted either from the soft clustering of content or from the hard clustering of users preferences.
Good old classifications can also work to make recommendations.
To do so, we can train classifiers to predict how likely it is that a customer intends to purchase or consume a product or service.
Even if multilabel classification is doable, we generally stick to one model per product to leverage a broader class of available models.
However, this observation is also the main downside of propensity-to-buy modeling. We generally need as many models to train as products to recommend, which makes the task very tedious if we can recommend thousands of products.
One potential solution is to rework the task by predicting first how likely it is that a customer intends to purchase or consume a category of products or services, instead of a specific product or service.
The greatest upside of propensity-to-buy modeling is the ease of interpreting it. Since this modeling is merely a classification, we have at our disposal a huge panel of techniques for model explainability.
Learning to rank v. ranking
The Learning to Rank (LTR) paradigm originates from the field of information retrieval.
It has been under extensive development since 2010, especially to enhance query searches.
As its name suggests, this paradigm is intended to predict an order of products reflecting their relevance.
To do so, it also needs historical relevance orders to train on, which may be tricky for practitioners to create. This is the first drawback of LTRk approaches.
Usually, very strong domain knowledge of the products and customer behavior is required to create a relevant order of products.
However, there exists an even greater downside of the LTRparadigm: it only works under very specific conditions.
Non-experimented practitioners and Machine Learning evangelists are usually misled by its name and wrongly extend its field of applicability.
Basically, LTR approaches leverage the similarity between a customer’s need/intent and the products to recommend. You may think it weird: How do you measure such similarity? Well, this is the essence of LTR but also its greatest pitfall.
Let me illustrate this statement with an example.
As already mentioned, LTR is used today for query searches and is intended to one day completely replace the very standard Elasticsearch.
In query searches, the customer’s need is summarized by the query they type while the products to recommend are documents like web pages. Both the query and the document are text data. These can be embedded in traditional or more sophisticated text representations: from TF-IDF to BERT embeddings.
Furthermore, the similarity of the query-document pair is multi-dimensional and usually contains at least 50 metrics (e.g. PageRank, BM25). These metrics are themselves called functions and the similarity vector is named a feature vector.
We need as many feature vectors as the number of customers times the number of products to recommend.
The feature vectors are basically the instances/points that are fed into customized statistical learning models, from Support Vector Machines to Ensembles of Decision Trees, resulting in the well-known LambdaMART.
The key takeaway is that the LTR paradigm needs, as instances, not a representation of the customer’s need/intent or a representation of the product to recommend but a single representation of both the customer’s need/intent and the product combined.
In a nutshell, if embedding the customer’s intent and the product into a common representation space is NOT straightforward, then applying LTR techniques is far-fetched and is unlikely to succeed.
A good example of a non-applicable field is recommending ancillary products (e.g. a priority boarding pass) to airline customers.
Even if LTR approaches cannot be applied, it does not mean that ranking is a no-go. Of course, to some extent, ranking products in a relevance order is doable.
To do so, we have at our disposal more or less sophisticated methods, beginning with leveraging calibrated likelihoods extracted by a simple propensity to buy models.
Both LTR techniques and simple Ranking approaches use the same metrics, from Normalized Discounted Cumulative Gain to Mean Average Precision or Kendall’s Tau.
These ranking metrics are usually harder to understand by the non-experimented practitioners rather than more usual classification metrics (like precision or recall at rank).
All the approaches mentioned are valid until we have enough historical data.
Now, imagine a scenario where the number of historical purchases for a given product is very scarce, but we still need to recommend it.
It is commonly admitted that below a ratio of 200:1 any classification task becomes as tedious as to enter the field of anomaly detection.
Traditional techniques for anomaly detection can then be used: from resampling the dataset (undersampling the majority class, oversampling the minority class, synthetically sampling via SMOTE or autoencoders) to simply taking the labels’ unbalanceness during the optimization if we still want to train supervised models, or even trying some unsupervised approaches like One-Class SVMs or Isolation Forests.
Why recommendations cannot be properly handled by automated Machine Learning
The landscape of techniques to solve the recommendation task is very broad. I have described some of these techniques but the list is in no way exhaustive. Automated Machine Learning has been taking a firmer footing toward trying to solve the problem in recent years.
However, the task is as complex, as the off-the-shelf solutions (e.g. Google Recommendations AI) currently offered are still in development; their beta version is still pending while the absence of any SLA makes them unusable for production in the industry.
Furthermore, recommendations are usually solved with a blend of multiple techniques.
This blend can only be initiated by reworking the problem and engineering different features from the beginning, which asks for domain knowledge and some experience.
Unfortunately, Automated Machine Learning may not be able to grasp such domain knowledge and specific feature engineering in the long run, even under extensive development.
Stay tuned for Part III., or subscribe to our newsletter so you don’t miss it!