[I5R6]
Majaj & Pelli (pp2017) give a brief overview of classification models in vision science, leading from linear discriminants and the perceptron to deep neural networks. They discuss some of the perks and perils of using machine learning, and deep learning in particular, in the study of biological vision.
This is a brief and light-footed review that will be of interest to vision scientists wondering whether and why to engage machine learning and deep learning in their own work. I enjoyed some of the thoughtful notes on the history of classification models and the sketch of the progression toward modern deep learning.
The present draft lists some common arguments for and against deep learning models, but falls short of presenting a coherent perspective on why deep learning is important for vision science, or not; or which aspects are substantial and which are hype. It also doesn’t really explain deep learning or how it relates to the computational challenge of vision.
The overall conclusion is that machine learning and deep learning are useful modern tools for the vision scientist. In particular, the authors argue that deep neural networks provide a “benchmark” to compare human performance to, replacing the optimal linear filter and signal detection theory as the normative benchmark for vision. This misses what I would argue is the bigger point: deep neural networks provide an entry point for modeling brain information processing and engaging the real problem of vision, rather than a toy version of the problem that lacks all of vision’s essential challenges.
Suggestions for improvements
(1) Clearly distinguish deep learning within machine learning
The abstract doesn’t mention deep learning at all. As I was reading the introduction, I was wondering if deep learning had been added to the title of a paper about machine learning in vision science at the very end. Deep learning is defined as “the latest version of machine learning”. This is incorrect. Rather than a software product that is updated in a sequence of versions, machine learning is a field that explores a wide variety of models and inference algorithms in parallel. The fact that deep learning (which refers to learning of deep neural network models) is getting a lot of attention at the moment does not mean that other approaches, notably Bayesian nonparametric models, have lost appeal. How is deep learning different? Does it matter more for vision than other approaches? If so, why?
(2) Explain why depth matters
The multiple stages of nonlinear transformation that define deep learning models are essential for many real-world applications, including vision. I think this point should be central as it explains why vision science needs deep models.
(3) Clearly distinguish the use of machine learning models to (a) analyze data and to (b) model brain information processing
The current draft largely fails to distinguish two ways of using machine learning in vision science: to analyze data (e.g. decode neuronal population codes) and to model brain information processing. Both are important, but the latter more fundamentally advances the field.
(4) Relate classification to machine learning more broadly and to vision
The present draft presents a brief history of classification models. Classification is a small (though perhaps arguably key?) problem within both machine learning and vision. Why is this particular problem the focus of such a large literature and of this review? How does it relate to other problems in machine learning and in vision?
(5) Separate the substance from the hype and present a coherent perspective
Arguments for and against deep learning are listed without evaluation or a coherent perspective. For example, is it true that deep learning models have “too many parameters”? Should we strive to model vision with a handful of parameters? Or do models need to be complex because vision requires complex domain knowledge? Do tests of generalization performance address the issue of overfitting? (No, no, yes, yes.) Note that the modern version of the statistical modeling, which is touted as more rigorous, is Bayesian nonparametrics – defined by no limits on the parametric complexity of a model.
(6) Consider addressing my particular comments below.
Particular comments
“Many perception scientists try to understand recognition by living organisms. To them, machine learning offers a reference of attainable performance based on learned stimuli.”
It’s not really a normative reference. There is an infinity of neural network models and performance of a particular one can never be claimed to be “ideal”. Deep learning is worse in this respect than the optimal linear filter (which provides a normative reference for a task – with the caveat that the task is not vision).
“Deep learning is the latest version of machine learning, distinguished by having more than three layers.”
It’s not the “latest version”, rather it’s an old variant of machine learning that is currently very successful and popular. Also, a better definition of deep is that there is more than one hidden layer intervening between input and output layers.
“It is ubiquitous in the internet.”
How is this relevant?
“Machine learning shifts the emphasis from how the cells encode to what they encode, i.e. from how they encode the stimulus to what that code tells us about the stimulus. Mapping a receptive field is the foundation of neuroscience (beginning with Weber’s 1834/1996 mapping of tactile “sensory circles”), but many young scientists are impatient with the limitations of single-cell recording: looking for minutes or hours at how one cell responds to each of perhaps a hundred different stimuli. New neuroscientists are the first generation for whom it is patently clear that characterization of a single neuron’s receptive field, which was invaluable in the retina and V1, fails to characterize how higher visual areas encode the stimulus. Statistical learning techniques reveal “how neuronal responses can best be used (combined) to inform perceptual decision-making” (Graf, Kohn, Jazayeri, & Movshon, 2010).”
This is an important passage. It’s true that single neurons in inferior temporal cortex, for example, might be (a) difficult to characterize singly with tuning functions, (b) idiosyncratic to a particular animal, and (c) so many in number and variety that characterizing them one by one seems hopeless. It therefore appears more productive to focus on understanding the population code. However, it is not only what is encoded in the population, but also how it is encoded. The format determines what inferences are easy given the code. For example, we can ask what information could be gleaned by a single downstream neuron computing a linear or radial-basis-function readout of the code.
“For psychophysics, Signal Detection Theory (SDT) proved that the optimal classifier for a signal in noise is a template matcher (Peterson, Birdsall, & Fox, 1954; Tanner & Birdsall, 1958).”
Detecting chihuahuas in complex scenes can be considered an example of detecting “signal in noise”, and it is an example of a visual task. A template matcher is certainly not optimal for this problem (in fact it will fail severely at this problem). It would help here to define signal and noise.
The problem of detecting a fixed pattern in Gaussian noise needs to be explained first in any course of vision, so as to inoculate students against the misconstrual of the problem of vision it represents. On a more conciliatory note, one could argue that although detecting a fixed pattern in noise is a misleading oversimplification of vision, it captures a component of the problem. The optimal solution to this problem, template matching, captures a component of the solution to vision. Deep feedforward neural networks could be described as hierarchical template matchers, and they do seem to capture some aspects of vision.
“SDT has been a very useful reference in interpreting human psychophysical performance (e.g. Geisler, 1989; Pelli et al., 2006). However, it provides no account of learning. Machine learning shows promise of guiding today’s investigations of human learning and may reveal the constraints imposed by the training set on learning.”
In addition to offering learning algorithms that might relate to how brains learn, machine learning enables us to use realistically complex models at all.
“It can be hard to tell whether behavioral performance is limited by the set of stimuli, or the neural representation, or the mismatch between the neural decision process and the stimulus and task. Implications for classification performance are not readily apparent from direct inspection of families of stimuli and their neural responses.”
Intriguing, but cryptic. Please clarify.
“Some biologists complain that neural nets do not match what we know about neurons (Crick, 1989; Rubinov, 2015).”
It is unclear how the ideal “match” should even be defined. All models abstract, and that is their purpose. Stating a feature of biology that is absent in the model does not amount to a valid criticism. But there is a more detailed case to be made for incorporating more biologically realistic dynamic components, so please elaborate.
“In particular, it is not clear, given what we know about neurons and neural plasticity, whether a backpropagation network can be implemented using biologically plausible circuits (but see Mazzoni et al., 1991, and Bengio et al., 2015).”
Neural net models can be good models of perception without being good models of learning. There has also been a recent resurgence in work exploring how backpropagation, or a closely related form of credit assignment, might be implemented in brains. Please discuss the work along these lines by Senn, Richards, Bogacz, and Bengio.
“Some biological modelers complain that neural nets have alarmingly many parameters. Deep neural networks continue to be opaque”
Why are many parameters “alarming” from the more traditional perspective on modeling? Do you feel that the alarm is justified? My view is that the history of AI has shown that intelligence requires rich domain knowledge. Simple models therefore will not be able to explain brain information processing. Machine learning has taught us how to learn complex models and avoid their pitfalls (overfitting).
“Some statisticians worry that rigorous statistical tools are being displaced by machine learning, which lacks rigor (Friedman, 1998; Matloff, 2014, but see Breiman, 2001; Efron & Hastie, 2016).”
The classical simple models can’t cut it, so their rigour doesn’t help us. Machine learning has boldly engaged complex models as are required for AI and brain science. To be able to do this, it initially took a pragmatic computational, rather than a formal probabilistic approach. However, machine learning and statistics have since grown together in many ways, providing a very general perspective on probabilistic inference that combines complexity and rigor.
“It didn’” (p. 9) Fragment.
“Unproven convexity. A problem is convex if there are no local minima other than the global minimum.”
I think this is not true. Here’s my current understanding: If a problem is convex, then any local minimum is the global minimum. This is convenient for optimization and provably not the case for neural networks. However, the reverse implication does not hold: if every local minimum is a global minimum, the function is not necessarily convex. There is a category of cost functions that are not convex, but every local minimum is a global minimum. Neural networks appear to fall in this category (at least under certain conditions that tend to hold in practice).
Note that there can be multiple global minima. In fact, the error function of a neural network over the weight domain typically has many symmetries, with any given set of weights having many computationally equivalent twins (i.e. the model computes the same overall function for different parameter settings). The high dimensionality, however, is not a curse, but a blessing for gradient descent: In a very high-dimensional weight space, it is unlikely that we find ourselves trapped, with the error surface rising in all directions. There are too many directions to escape in. Several papers have argued that local minima are not an issue for deep learning. In particular, it has been argued that every local minimum is a global minimum and that every other critical point is a saddle point, and that saddle points are the real challenge. Moreover, deep nets with sufficient parameters can fit the training data perfectly (interpolating), while generalizing well (which, surprisingly, some people find surprising). There is also evidence that stochastic gradient descent finds flat minima corresponding to robust solutions.

“This [convexity] guarantees that gradient-descent will converge to the global minimum. As far as we know, classifiers that give inconsistent results are not useful.”
That doesn’t follow. A complex learner, such as an animal or neural net model, with idiosyncratic and stochastic initialization and experience may converge to an idiosyncratic solution that is still “useful” – for example, classifying with high accuracy and a small proportion of idiosyncratic errors.
“Conservation of a solution across seeds and algorithms is evidence for convexity.”
No, but it may be evidence for a minimum with a large basin of attraction. Would need to define what counts as conservation of a solution: (1) identical weights, (2) computationally equivalent weights (same input-output mapping). Definition 2 seems more helpful and relevant.
““Adversarial” examples have been presented as a major flaw in deep neural networks. These slightly doctored images of objects are misclassified by a trained network, even though the doctoring has little effect on human observers. The same doctored images are similarly misclassified by several different networks trained with the same stimuli (Szegedy, et al., 2013). Humans too have adversarial examples. Illusions are robust classification errors. […] The existence of adversarial examples is intrinsic to classifiers trained with finite data, whether biological or not.”
I agree. We will know whether humans, too, are susceptible to the type of adversarial example described in the cited paper, as soon as we manage to backpropagate through the human visual system so as to construct comparable adversarial examples for humans.
“SDT solved detection and classification mathematically, as maximum likelihood. It was the classification math of the sixties. Machine learning is the classification math of today. Both enable deeper insight into how biological systems classify. In the old days we used to compare human and ideal classification performance. Today, we can also compare human and machine learning.”
“…the performance of current machine learning algorithms is a useful benchmark”
SDT is classification math for linear models, ML is classification math for more complex models. These models enable us to tackle the real problem of vision. Rather than comparing human performance to a normative ideal of performance on a toy task, we can use deep neural networks to model the brain information processing underlying visual recognition. We can evaluate the models by comparing their internal representations to brain representations and their behavior to human behavior, including not only the ways they shine, but also the ways they stumble and fail.