March 4. 2024. 11:42

The Daily

Read the World Today

The AI Act should use humans to monitor AI only when effective

The EU’s proposed Artificial Intelligence Act (AI Act) wants to use humans to oversee AI-generated decisions, but recent evidence suggests one should carefully test it when it becomes possible.

Nijeer Parks was arrested for a crime he didn’t commit. He was arrested due to an incorrect match produced by a facial recognition algorithm. Being already the third known case of a Black man being wrongfully arrested in the US, these false arrests illustrate the lower accuracy of facial recognition algorithms for Black faces.

But what failed Mr Parks was not just technology. Had police officers double-checked the matching images, they would have noticed the suspect, and Mr Parks did not look alike. The suspect in the photo even wore earrings — whereas Mr. Parks had no piercings. The human police officers should have overseen the algorithmically generated arrest recommendation but failed to do so.

The EU’s proposed AI Act relies heavily on humans overseeing AI decision support systems to prevent harmful outcomes in high-risk applications. Machine learning algorithms increasingly often support human decision-making in settings that are critical for our society: For example, in healthcare, when algorithms recommend which patients should undergo expensive treatments, in hiring decisions, when they suggest which applicant to invite for an interview or in financial loan decisions. In all of these cases, a human has the final say and could adjust or even overrule an algorithmically derived recommendation.

Human oversight of AI systems can work well. One study in a child welfare decision-making context in the United Kingdom found that humans were indeed able to identify poor algorithmic advice. For another illustration, consider Large Language Models like ChatGPT. Presumably, many of the readers of this opinion piece have been trying out such models themselves in the last couple of weeks and probably soon encountered situations in which the chatbots produced obviously nonsensical answers. With these examples in mind, it is easy to imagine applications where human oversight has its place.

Yet all too often, human oversight fails, like in the case of Mr Parks. The evidence that humans often cannot be good supervisors for AI has mounted in recent years. In a new experiment, we show that people cannot accurately assess the advice quality of an advising algorithm.

Participants of the experiment had to solve a simple task and received advice from a supporting algorithm. Unbeknownst to them, we had tampered with the algorithm, resulting in poor recommendations. Despite this, our participants remained steadfast in their reliance on the algorithm, failing to recognise the magnitude of its inaccuracy even after multiple rounds of playing the game. This result from the lab is supported by various field studies: Judges, physicians, and the police have all been found to be poor algorithm monitors.

The reasons for this finding are as intriguing as they are plentiful. Situations where humans make decisions with the help of algorithms, are teeming with psychological effects. For example, to a certain degree, people feel that relying on AI absolves them of their responsibility. Its recommendation sets a seemingly objective default from which it is difficult to deflect.

Often, when the task is perceived as more abstract and mathematical in nature, humans have been found to rely too much on algorithms, a phenomenon termed “algorithmic appreciation”. In other cases, where the setting is perceived as more subjective, one finds “algorithm aversion”, i.e. people erroneously do not follow superior algorithmic advice.

The insight that humans are not infallible overseers of AI has yet to make its way into the draft for the AI Act.

Based on our research, we, therefore, make three recommendations. First, the AI Act should recognise that human oversight is no panacea and can fail. As described above, it depends on the task and the context if human oversight works.

Second, tests assessing the feasibility and efficacy of human oversight in preventing harm should therefore be mandatory for high-risk AI applications. In their most elementary form, such tests would compare actual outcomes under human oversight with the hypothetical outcomes that would have resulted without human intervention. A more advanced test could use context-dependent additional information that might help humans to be effective AI supervisors. An example of such additional information could be to provide timely feedback about past decision outcomes, an approach which has shown promising results in our own research.

Finally, if human oversight is found to fail for a given AI application, one should relinquish on using the AI system in its current form.

Nijeer Parks was released after ten days in jail and spent around $5000 for his defence. Instituting more carefully implemented human oversight has the potential to avoid similar injustices in the future.