Deconstruction of fake #AI Benchmarks - Recommender Systems Case Study
We have recently spent a lot of time on creating & delivering top-notch AI-driven solutions and products in Synerise. Just in the last year we have published 6 papers, showing what attitude to scientific work our team has. Moreover we were awarded or won different prestigious competitions organized by such amazing companies like Booking.com, Rakuten or supported by specialists from Amazon, Apple, Nvidia, Alibaba, Adobe, Zalando, eBay and more - all of these events were very professional - both in terms of methodology and measurement techniques. We took part in a dozen conferences presenting our scientific papers, invested in the AI Schools program for the youngest people - because we believe that we can support democratization of AI only by educating society. We also published our open-source framework cleora.AI, a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data. From the data processing perspective, we have built from the scratch our proprietary DB engine - Terrarium. Day by day, we are transparently sharing our ideas, ambition of projects we delivered and present vision - what we want to achieve in future.
Recently, the biggest enemy for AI propagation are: fake research, selling poor quality solutions by immature companies, understatements, intellectual bias and over-hyped results and overpromised ROI. Each month we observe new self-made “benchmarks” published by many companies - prepared to convince clients to choose specific solutions because it is “state of the art”. Reading such studies is a really funny moment - the number of ridiculous things you can find & read in such ebooks and offers is pretty high.
In that context, I want to explain what is really important when you try to audit and validate a vendor you want to work with, by showing real examples of how to deconstruct false advertising claims in AI field of Recommender Systems.
One of #AI self named company has recently published a document purporting to benchmark their recommender systems. Due to numerous false claims, misrepresentations and erroneous methodology, we perform a fact-checking of the document to prevent spread of false and potentially harmful misinformation and make it as a guide for companies to validate AI companies and their results. Lets deconstruct falsehoods and convert it in a nice guide.
Using work from 6 YEARS AGO as a baseline:
RED FLAG! MISLEADING 🤣
While comparing to prior methods, it is important to pick the most recent and most relevant prior publications. 6 years in Machine Learning is a very long time, and results from such a long time ago are often terrible compared to what is state-of-the-art now. A single GPU today can do the work of a Google-sized cluster back then.
While the paper is sound and has been published at a prestigious conference, it is cited as:
Which is inconsistent with any accepted publication quoting standards and we have to assume that intentionally omits the 2015 date of publication. According to current standards, the approaches in the publication are extremely weak baselines.
Using the WRONG METRICS: ROOKIE MISTAKE
R-Precision as a metric has not been used to evaluate recommender systems since 2015!
The correct metric that replaced it is Precision@K – which considers that the user sees only the top K, most relevant recommendations. Most publications evaluate recommender systems reporting for a few values of K, e.g. Precision@10, Precision@20, Precision@100. There are other important metrics for recommender systems as well: Recall@K, Mean Reciprocal Rank and Normalized Discounted Cumulative Gain – every one of them is important to have a well-balanced recommender system. These are all missing from vendor presentation. But knowing these numbers would still be useless when comparing systems on different, undisclosed datasets.
Comparing to prior work based on UNDISCLOSED DATASETS: ABSOLUTE BULL "YOU-KNOW-WHAT"
When comparing ML algorithms and models, it is absolutely necessary to hold the dataset constant. The same algorithm will always give different results on different datasets. If a publication uses undisclosed datasets, it is impossible to compare any method against it once it has been published!
The referenced publication does not identify the datasets, making any subsequent replications or comparisons impossible:
Neither does the vendor identify their datasets, but one thing is certain: they are using different datasets than the ones used in the paper! (They must be, as the datasets used in the paper have not been disclosed.)
Comparing a method on some mysterious dataset of your own to different methods on some equally mysterious datasets from an ancient paper proves only one thing: the authors definitely have no idea what they’re doing. The methodology is simply incorrect.
A dataset can be prepared to give ANY results. When the dataset is secret, and your method is the only one tested - the results can be anything and have exactly zero meaning.
There are multiple established and widely accepted public datasets for benchmarking recommender systems, e.g. Netflix Prize Dataset, MovieLens datasets, Diginetica, Yoochoose.
Claim that “last visited product” recommendation is 2x better than the state-of-the-art: INSANE BULL "YOU-KNOW-WHAT"
This is just insanely ridiculous. Think about it.
If showing the user his/her most recently seen products was more than 2x better than anything on the market, why would Amazon, Youtube, Alibaba, Netflix and many others invest billions of dollars in recommender research? The big guys are fighting for 0.1% gains, 200% gains would surely blow their mind! This again proves that the authors not only don’t have any idea what they’re doing, but also may be harmful! But how can such an insane claim arise? See the next point:
Comparing Top-K recommenders to Session-based recommenders: ABSOLUTE BULL-"YOU-KNOW-WHAT"
The referenced whitepaper uses a setting called “top-k” recommendation – recommending items based on item sets, without any sequential or temporal information. It is an entirely different problem setting & research setting compared to “session-based” recommendation, which explicitly uses sequences and timestamps. "Last seen products” is a baseline approach in session-based recommendation, but cannot be applied to top-k. The authors confuse the 2 approaches and compare one to the other, resulting in the ridiculous claim that “last seen products” outperform state-of-the-art recommenders.
Suggestion that a 2015 paper evaluated somebody else’s 2021 method: MISLEADING
The screenshot of a table from a 2015 whitepaper, combined with the phrasing:
“Fortunately, across the world there are many researches published by scientists that evaluate well known recommendation algorithms. Here’s what they found:”
seems to misleadingly suggest that some scientists looked at the claims of VENDOR and published them independently. In reality, the numbers have been generated by the company, have not been peer reviewed, and due to the aforementioned ridiculous claims (comparing apples to oranges, comparing the same numbers on different datasets, using the wrong metrics) would not pass careful peer-review.
Some methods presented have significantly limited applicability: MISLEADING
Methods such as “Abandoned cart” can only be applied to a limited number of cases (only users who have some items in an abandoned shopping cart). Comparing the performance of such methods working on a small subsample of the user population with methods able to operate on all users is highly misleading.
Best performing methods are not ML/AI powered & do not consider client history: FALSE CLAIM
According to the presentation, “Frequently bought after visit” is the best performing method. This method does not take into consideration actual client’s interaction history during inference. As described on slide 19, “Frequently bought after visit” recommends products “bought by other users who displayed the product being viewed”. It is a product-to-product recommendation, not client-to-product recommendation, making comparison to other methods such as Collaborative Filtering very misleading.
The fact, that a simple baseline such as “frequently bought after visit” outperforms real ML-powered approaches, which consider the user’s interaction history is inconsistent with the state-of-the-art research & applied engineering practice across the world. It should be considered a proof of inadequate exploration of widely available ML-powered approaches.
Slides confirms the ineffectiveness of vendor AI methods clearly showing that simple heuristic methods outperform their AI-based methods!!!. Based on the wide body of published research & accepted industry practices, one could reasonably expect that AI-based methods outperform simple heuristics.
Proper recommender evaluations:
WHAT IT SHOULD LOOK LIKE
A proper evaluation of a recommender system should exhibit the following key characteristics:
- Clear and detailed definitions of used datasets. If public datasets are available (such as Netflix Prize, MovieLens, Retail Rocket for recommendations), then they should be used.
- If the paper is using unidentified, proprietary datasets, it is not enough to measure performance of a single model on this dataset. Other, prior work has to be replicated and compared on the introduced proprietary dataset.
- The recommendation problem should be clearly stated (e.g. top-k recommendation or session-recommendation), and methods solving the problem should be compared to other methods solving the same problem. Comparing anything on different problems is meaningless.
- When evaluating methods in a practical, applied setting, one cannot compare against synthetic published benchmarks. Many factors can impact the performance of live recommender systems – the design & layout of a website, the way in which items are recommended, placement of the recommendation section, the size of images etc. The only proper way to evaluate production systems is via an A/B/X testing approach.
- References, benchmarks and comparisons should be done against recent publications. Compared solutions don’t have to be exclusively state-of-the-art, but should be up-to-date with the current state of research knowledge.
- Proper evaluation metrics should be chosen according to the best practices, or a convincing argumentation of a different choice has to be made.
- A detailed description of the evaluation protocol, including any data pre-processing, filtering & tuning must be described. The definition of metrics used must be also made clear.
- Preferably a publication containing the above characteristics should be peer-reviewed by a respected body, such as a leading conference or a journal.
- Preferably source-code allowing to replicate experiments and confirm the claims should be
- Don't make your clients & partners look like idiots
Thanks for your time. Sharing and optimizing knowledge is the best thing you can do to move AI disciplines forward.
Link to our scientific papers below:
- A Strong Baseline for Fashion Retrieval with Person Re-Identification models
- Multi-modal Embedding Fusion-based Recommender
- An efficient manifold density estimator for all recommendation systems
- Synerise at SIGIR Rakuten Data Challenge 2020: Efficient Manifold Density Estimator for Cross-Modal Retrieval
- Cleora: A Simple, Strong and Scalable Graph Embedding Scheme