AI and the Everything in the Whole Wide World Benchmark

This is my personal point of view on the AI and the Everything in the Whole Wide World Benchmark by Inioluwa Deborah Raji et al. The article can be download at AI and the Everything in the Whole Wide World Benchmark.

I think the paper was accepted for publication at NEURIPS 2021 as it highlights the risk of over-reliance of machine learning on benchmarking. Growing enthusiasm in North Star type dataset, which positions itself (or hyped by the ML communities) as benchmarks for “general purpose” visual object recognition or language understanding which in many ways, authors believe, creates undesirable consequences. Moreover, the paper critically supports its arguments with historical background and tries to convey a message that the machine learning communities should not confuse “models of reality” with “reality” itself.

The paper mainly focuses on two “general” dataset-based benchmarks, namely ImageNet and GLUE. For example in the case of ImageNet, the project said to be an “attempt to map the entire world of objects” is similar to the claim in the Grover’s museum story to showcase “everything in the whole wide world”. Moreover, the paper discusses the validity and the limitation of a “general” benchmark, such as limited task design, de-contextualized data & performance reporting; in which the authors emphasize the fact that no dataset is neural and there are inherent subjective biases. For example, specific to certain cultural views or how the photos were taken during the creation of the dataset etc. According to the authors, all these leads to inappropriate community use.

Strengths and Weaknesses: The paper brings an important philosophical view of “designed benchmarks”, either intentionally or unintentionally created biases could directly threaten the construct validity. The authors rigorously develop the arguments and question about the validity and imperfect nature of “general” dataset-based benchmarks. By incorporating historical arguments, the authors highlight from previous works that the state of the art chasing or metric chasing as an ethical issue leads to manipulation and can create long term damages to the field of General Artificial Intelligence.

The paper elucidates the over-reliance of machine learning on benchmarking very well, nevertheless, proposed solutions could have been in the main part of the paper itself instead of in the appendix. Moreover, for the question of attaining neutrality in the benchmark and capturing “general capabilities” is far more difficult or even impossible in the current context, but is that feasible in the future? Is that all about how large & diverse the Grover’s museum space is?. Furthermore, in epistemology, it is always hard and an open-ended problem to define what General Intelligence is. In that sense, saying the tasks/subtasks that we develop in machine learning “general” benchmarks are not good abstractions is also questionable.