Neural network (NN) models have brought spectacular progress in computer vision and visual computational neuroscience over the past decade, but their performance, until recently, was quite brittle: breaking down when images are compromised by occlusions, lack of focus, distortions, and noise — sources of nuisance variation that human vision is robust to. The robustness of recognition has substantially improved in recent models with extensive training and data augmentation.
Extensive visual experience also drives the development of human visual abilities. Humans, too, experience a vast number of visual impressions, many of them compromised and all of them embedded in a context of previous visual impressions and information from other sensory modalities, including audition, that can constrain the interpretation of the scene and drive visual learning. Do state-of-the-art robust NN models provide a good model of robust recognition in humans, then?
A new paper by Huber et al. (pp2022) suggests that a training-based account of the robustness of human vision, along the lines of the recent advances in getting NN models to be more robust through extensive training, is uncompelling. Current NN models, they argue, lack some essential computational mechanisms that enables the human brain to achieve robustness with less visual experience.
The authors measured recognition abilities in 146 children and adolescents, aged 4-15, and found that even the 4-6 year-olds outperformed current NN models at recognizing images robustly under substantial local distortions (so called eidolon distortions). They argue that back-of-the-envelope estimates of the amount of visual experience suggest that humans achieve greater robustness with less training data. The human visual system must have some additional mechanism in place that current NN models lack.
One possibility is that human vision has mechanisms to perceive the global shape of objects more robustly than current NN models. Using ingenious shape-texture-cue-conflict stimuli, which they introduced in earlier work, the authors show that the well-known human bias in favor of classifying objects by their shape is already present in the 4-6 year olds. Testing the models with the shape-texture-cue-conflict stimuli showed, by contrast, that even the most extensively trained and robust NN models rely much more strongly on texture than on shape.
To compare the amount of visual experience between humans and models, the authors offer a back-of-the-envelope calculation (their appropriate term), in which they quantify human visual experience at a given age in the currency of NN models: number of images. They use estimates of the number of waking hours across childhood and of the number of fixations per second. One fixation is assumed as roughly equivalent to a training image. According to such estimates, the best model (SWAG) requires about an order of magnitude more data to reach human-level robustness.
This calculation and the corresponding figure are interesting because they provide a starting point for an important discussion. However, the estimate suggesting an order of magnitude difference in the amount of data required could easily be off by more than an order of magnitude.
More importantly, the estimate (though it is an interesting starting point) is fundamentally flawed and should be accompanied by more critical arguments. Human visual experience is temporally continuous and dependent and therefore cannot meaningfully be quantified in terms of a number of training images or exposures (counting multiple exposures to augmented versions of the same image across epochs).
It is also unclear why fixations should be equated to images. We see a dynamic world evolve at a rate much faster than the rate of fixations. Moreover, fixations are actively chosen, so their information content may be greater than that of a similar number of i.i.d. samples. (This could count as one of the qualitative differences between primate vision and current NN models: Primate visual recognition is active perception, and visual learning is active learning: The animal makes its own curriculum and this could contribute to its learning more from less data.)
A simpler calculation (and the one I couldn’t resist typing into my calculator before getting to the authors’) would equate frames (perhaps 10 per second?) to training images. Of course, frames are not a well-defined concept, either, in the context of human visual experience and, at 10 frames per second, successive frames are highly dependent. However, temporal dependency may be a critical feature, helping rather than hurting visual learning. At 10 frames per second, the calculation yields an estimate surprisingly close to the “amount of visual experience” of the state-of-the-art models.
Another reason why comparing visual experience between models and humans is inherently difficult concerns the quality, rather than the quantity, of the visual input. The out-of-distribution generalization challenge is not (and cannot readily be) matched between humans and models. Human visual experience may include more distorted inputs due to physical processes in the world such as rain and glass obscuring the scene as well as due to optical imperfections of our eyes. As a result, human visual experience may provide better training for generalizing to the eidolon distortions than the training sets used for the most extensively trained models (SWAG and SWSL).
The claims relating to the comparison of the “amount of visual experience” between humans and models should be tempered in revision and more critically discussed with a view to directions for future studies. It would also be good to add statistical inference to demonstrate that the reported effects generalize across stimuli and subjects. The error consistency analysis is important. However, I find the boxplots hard to interpret. It would be great to see inferential comparisons between different DNNs, where currently DNNs are lumped together despite the fact that there appears to be little inter-DNN error consistency.
The authors are almost certainly correct that current NN models lack essential computational mechanisms. However, I’m not sure if the estimates of the amount of visual experience in the current version of the paper provide strong evidence for the greater data efficiency of human vision.
Overall this paper describes an important, carefully designed and executed study and offers a unique open-science human developmental cross-sectional data set on object-recognition robustness for further systematic analyses. The use of state-of-the-art models and the careful discussion of the state of the field make this a great contribution.
- Important comprehensive novel behavioral data set
- Challenge of experimenting with kids of different ages met with carefully designed and executed experiment
- All code and data available via github
- Comparison to four NN models that represent the state of the art at out-of-distribution robust recognition and span four orders of magnitude of training-set size (1M, 10M, 100M, 1B images)
- Interesting discussion highlighting the difficulty of quantifying and comparing “the amount of visual experience” between models and humans
- The “back-of-the-envelope” calculation on the amount of visual experience is not just a very rough approximation, but conceptually flawed: Human visual experience is temporally continuous and dependent, and thus cannot be approximately quantified in terms of a number of i.i.d. images.
- The out-of-distribution generalization challenge is not (and cannot readily be) matched between humans and models. Human visual experience may provide better training for generalizing to the eidolon distortions than the training sets used for the most extensively trained models (SWAG and SWSL).
- Hypotheses are not evaluated by statistical inference to generalize to the populations of subjects and stimuli.
- Age may be confounded by ability to attend on the task and by factors related to participant recruitment. (However, this reflects inherent difficulties of the research not shortcomings of this particular study.)
- Model architecture is not varied systematically and independently of training regime. (However, this is very hard to achieve given the scale of the models and training sets, and they key conclusions appear compelling despite this shortcoming.)
“not only subjective effortless but objectively often impressive” typo: should be “subjectively”. Also: impressiveness is inherently subjective.
knive -> knife
Fig. 4: Panel labels (a), (b) should be bigger, bold and above, not below the panels. The top should be a.
Fig. 4a: The logarithmic horizontal axis tick labels are inconsistent between the panels.
Fig. 5 (left): accuracy delta should be described “4-6 year-olds minus adults”, not vice versa