How to evaluate AI solutions and not get tricked

12 min read
Blurred imageHow to evaluate AI solutions and not get tricked

AI is a discipline whose development has recently gained unprecedented momentum. Regretfully, apart from all the potential of this technology and the vast resources allocated to its development, there have appeared negative market phenomena associated with its growth. Artificial intelligence has become so on-trend that some actors decided to capitalize on it in a rather debatable if not deceitful fashion. How not get tricked at the service of pseudo-artificial intelligence?

  • Artificial intelligence has become a trend that many companies are eager to sign up for.
  • It is worth paying attention to whether the company has a policy of testing and comparing AI models and whether it can justify selecting practical measures and test data.
  • Before using the solutions of the company, it is worth asking questions about the limitations of the models offered

Artificial intelligence as a scientific discipline 

Artificial intelligence is an extremely rapidly developing discipline. The fast pace of progress makes research developments outdated quickly. Analysis shows that since 2015, the average interval between the publication of two high-impact articles (belonging to the top 20% most cited) has been less than 10 hours. This interval has been quickly becoming smaller and smaller. Of course, this does not mean that there is a breakthrough in AI every 10 hours, but it does give some idea of the speed at which the discipline is developing. The open archive of scientific publications - arXiv.org - has added as many as 11,524 articles in the AI category in 2021. 

There may be a gap between the most recent models and those from 5 years ago. The fact that a solution was considered the best five years ago does not necessarily mean that it will continue to outperform the competition. 

Evaluation errors - what do you need to know when choosing an artificial intelligence solution? 

Another important aspect of an informed choice is testing methods used by companies that create solutions in AI. One needs to be aware of basic errors in evaluating the solution or activities that don’t comply with widely accepted evaluation methodology. They are often intended to increase the credibility of the company. It is worth paying attention to whether the company has its own policy of testing and comparing AI models and whether it is able to credibly justify the selection of evaluation metrics and test data.  


The evaluation metrics 

Depending on the type of task, different metrics can be specified. One of the most popular metrics is accuracy. It determines how many of the test examples were solved correctly. However, this won’t be the best way to evaluate a solution in many cases. 

Let's assume that an AI model is tasked with assessing whether users are suffering from a particular disease based on the symptoms they type in. Let’s take a test sample of 100 users in whom we have diagnosed or ruled out the disease under study. We can evaluate the model by comparing it to a generally accepted method of diagnosis. If the model makes the correct diagnosis for 97 patients, we are talking about a 97% accuracy rate. This result can be considered very good. However, consider a situation where only 3 of the patients studied were sick in the first place. Furthermore, let's consider that our model simply labels ALL patients as healthy, without the use of any AI whatsoever. The result would be 97 per cent here, too, but then we wouldn’t say that the model did its job well. Metrics such as precision and recall are supposed to address this problem. With the above mentioned patient, precision determines what fraction of patients that the model classified as sick were actually sick. Recall, in turn, defines the ratio of patients classified as ill to all sick patients. 

The metrics mentioned above will only work well for some tasks, such as classification. However, there are many solutions for which these metrics will be insufficient. When evaluating search engines or recommender systems, we often care about the ranked list of the most appropriate search results or recommendations. We expect relevant items to appear near the top of the result list. The metrics that evaluate these solutions are meant to reflect this. The most popular ranking metrics are NDCG, Recall@N, Precision@N, MRR, MAP. It is worth noting that the best results are achieved at the top of the ranking list (e.g., Recall@10 means the result on the first ten recommendations/search results - it will be a more relevant indicator than Recall@100). 

In the domain of natural language processing (NLP), we have a vast number of distinct metrics, such as BLEU or METEOR, that allow examining the degree to which a generated text differs from that expected or a translation generated automatically from a translation created by a translator. We see, therefore, that different types of solutions and problem domains require the use of different metrics, and some of them are consensually recognized as standard. Each measure exposes a different side of the issue. The introduction of proprietary evaluation metrics should arouse vigilance every time. Perhaps they are purposefully constructed formulas that hide the flaws of a particular algorithm. The most reliable evaluation metrics are the tried and tested ones known from the literature. 

Promises versus the reality - how to evaluate? 

The knowledge of evaluation metrics will allow us to assess whether the company is promising things that are real or trying to bottle us up. It does not come as a surprise that a promise of a substantial increase in productivity or traffic, and ultimately in savings or profits is tempting. However, understanding your business can assess whether a 50% increase in click-through rate (CTR) is realistic. Such a promise is certainly catchy PR-wise, but we must be aware that in most cases, it is difficult to define a certain universal percentage of profit that will be achieved using a given AI method, as the areas in which the solutions are implemented are very different. 

Using the example of recommender systems, in the scientific literature, the differences between the best models are usually at most 1-3 points of a given measure (e.g., increase of the Recall@N measure from 36.2 to 37.5). Significantly higher increases may suggest errors in evaluation such as comparison with too weak, outdated methods. 

Evaluation datasets. Is the model evaluation reliable? 

Except for the evaluation metrics themselves, selecting the test data is also extremely important. The solutions we want to compare should always be tested on the same dataset. Then, and only then, we can realistically compare the performance of the two models. It is important to remember that this result does not necessarily correspond to how the solution will perform in the final application, for example, on complete production data that changes from day to day. You have to be careful about evaluations on unpublished, internal, hidden, or private datasets. Why? Internal datasets can be designed to produce completely arbitrary results. The results obtained in this way are impossible to verify. There exist so-called benchmarks for specific tasks, specially selected data sets, and evaluation metrics that are used to compare models. They are widely accepted by the scientific community and clearly illustrate how the models perform in specific tasks. Popular benchmarks include for recommendation tasks - Netflix Prize Dataset, MovieLens datasets, Diginetica, Yoochoose - while for NLP tasks - GLUE, SuperGLUE. As for the image recognition task, one of the accepted benchmarks is ImageNet. 

Last but not least, one needs to bear in mind that even a small manipulation of a dataset can cause significant differences in the final results, so care must be taken to ensure that datasets are not automatically filtered or manually altered. 

The right team 

Creating artificial intelligence solutions requires expertise, so it can be important to check whether a company has an adequate staffing structure to develop innovative AI products. The AI team needs a diverse range of skills. Some roles focus on research work, which involves monitoring developments and creating new mathematical models and algorithms. Some others are related to the implementation of a given solution into a specific product. An advantage of a good team is people from a scientific background in computer science or mathematics and AI developers with experience in integrating solutions within particular applications. The role of an MLOps engineer focused on implementing good practices that increase the efficiency and reliability of the product lifecycle, is becoming more and more popular. Also, a good track record of papers published in international journals or post-conference materials can be an advantage and provides evidence that the team's research ideas have undergone a scientific review process. 

Is AI at the heart of the product? Is AI in the product at all? 

It's also worth knowing that a company that boasts about using AI technology doesn't necessarily use it in its product. One practice is when some simple ready-made AI tool is used for a specific sub-task, while the main task is resolved with other methods than AI. This kind of practice often makes sense unless the company calls their final solution AI-drive. We can imagine a company that provides software for product categorization, which uses a simple keyword search by category name in the product description. This method has little to do with artificial intelligence. However, the company may use an off-the-shelf AI-based tool for lemmatization – reducing words to their base form, to make it easier to find the category name in the description. Thus, the company does not create an AI-based solution to categorize products but only uses a ready-made tool in some part of the process. 

Such a solution will be prone to low performance, as the method is very simple and does not use sophisticated AI. Moreover, the possibilities of further development will be limited and dependent on the producer of the off-the-shelf tool. Knowing how the final solution uses AI allows us to realistically assess the opportunities and potential risks associated with using this type of technology. Therefore, it is worth investigating which element of the overall solution uses AI and whether any research work is being carried out related to the development of such a technology. 

Research by venture capital fund MMC published in its State of AI 2019 report shows that limited confidence should be placed in that regard. It also points out that out of 2,830 start-ups purporting to create AI solutions, only 1,580 could be found to use these technologies. 

Human work under the hood of AI 

Two cases are also worth considering: when human work is under the hood of AI and when AI supports human work. The Wizard of Oz experiment involves creating a situation where a person interacts with a computer program that is believed to be autonomous when in fact it is partially controlled by humans. There are known cases of AI companies that, while unable to deliver the expected results with AI technology, include humans in the part of the process. Unless the company does this in an open way and clearly communicates the limits of its solution - such a practice meets the conditions of the Wizard of Oz experiment. 

However, there are also cases involving specific industries where solutions require human intervention because, for example, a model error could have fatal consequences, so a human verifies the program response. Intentionally including a person in the whole process, in this case, is called "human-in-the-loop" and is usually a good practice. It is a frequent approach, for example, in case of programs that assist in medical diagnostics. 

It is worth asking questions. 

Before using a particular company's solutions, it is worth asking questions about the limitations of the models on offer and mainly: 

1. whether they use all the types of data we have, or only some of them; 

2. what output values are they able to return (i.e. rankings, classifications, a generated text), and do they match all the business cases we want to handle; 

3. Whether they return only the predicted outcome (0/1, buy/no-buy) or also the probability of such an event; 

4. whether the models are interpretable - once we run them, will we be able to understand why they made a particular decision and which factors played a decisive role; 

5. how much data is needed for the model to work satisfactorily; 

6. whether the data needs to be prepared and cleaned up in a certain way. If so, who will do it; 

7. whether the AI system runs on a vendor or customer infrastructure in the cloud. Relevant to data privacy. 

The golden rule: it is always worth verifying 

As with financial audits and other ways of checking the integrity of companies, there is no foolproof method in AI to determine whether a company is offering us good AI or even whether it is offering AI at all. However, deepening our knowledge in this area will help us to ask the right questions. Artificial intelligence has already found its place in our world and affects our daily lives. Basic knowledge in this area will not only help us to make informed choices of products and services that support our business but should be considered common knowledge in the world where modern technology influences our day-to-day life. If we want to understand the reality around us, we need to understand artificial intelligence. 

Article was written by Maria Janicka and Barbara Rychalska 

--------------- 

The original text was published in Business Insider Polska, the article is part of the “UNDERSTANDING AI” and Synerise.