Montobbio, Bonnasse-Gahot, Citti, & Sarti (pp2019) present an interesting model of lateral connectivity and its computational function in early visual areas. Lateral connections emanating from each unit drive other units to the degree that they are similar in their receptive profiles. Two units are symmetrically laterally connected if they respond to stimuli in the same region of the visual field with similar selectivity.
More precisely, lateral connectivity in this model implements a diffusion process in a space defined by the similarity of bottom-up filter templates. The similarity of the filters is measured by the inner product of the filter weights. Two filters that do not spatially overlap, thus, are not similar. Two filters are similar to the extent that their filters don’t merely overlap, but have correlated weight templates. Connecting units in proportion to their filter similarity results in a connectivity matrix that defines the paths of diffusion. The diffusion amounts to a multiplication with a convolution matrix. It is the activations (after the ReLU nonlinearity) that form the basis of the linear diffusion process.
The idea is that the lateral connections implement a diffusive spreading of activation among units with similar filters during perceptual inference. The intuitive motivation is that the spreading activation fills in missing information or regularizes the representation. This might make the representation of an image compromised by noise or distortion more like the representation of its uncompromised counterpart.
Instead of performing n iterations of the lateral diffusion at inference, we can equivalently take the convolutional matrix to the n-th power. The recurrent convolutional model is thus equivalent to a feedforward model with the diffusion matrix multiplication inserted after each layer.
In the context of Gabor-like orientation-selective filters,the proposed formula for connectivity results in an anisotropic kernel of lateral connectivity that looks plausible in that it connects approximately collinear edge filters. This is broadly consistent with anatomical studies showing that V1 neurons selective for oriented edges form long-range (>0.5 mm in tree shrew cortex) horizontal connections that preferentially target neurons selective for collinear oriented edges.
Since the similarity between filters is defined in terms of the bottom-up filter templates, it can be computed for arbitrary filters, e.g. filters learned through task training. The lateral connectivity kernel for each filter, thus, does not have to be learned through experience. Adding this type of recurrent lateral connectivity to a convolutional neural network (CNN), thus, does not increase the parameter count.
The authors argue that the proposed connectivity makes CNNs more robust to local perturbations of the image. They tested 2-layer CNNs on MNIST, Kuzushiji-MNIST, Fashion-MNIST, and CIFAR-10. They present evidence that the local anisotropic diffusion of activity improves robustness to noise, occlusions, and adversarial perturbations.
Overall, the authors took inspiration from visual psychophysics (Field et al. 1992; Geisler et al. 2001) and neurobiology (Bosking et al. 1997), abstracted a parsimonious mathematical model of lateral connectivity, and assessed the computational benefits of the model in the context of CNNs that perform visual recognition tasks. The proposed diffusive lateral activation might not be the whole story of lateral and recurrent connectivity in the brain, but it might be part of the story. The idea deserves careful consideration.
The paper is well written and engaging. I’m left with many questions as detailed below. In case the authors chose to revise the paper, it would be great to see some of the questions addressed, a deeper exploration of the functional mechanism underlying the benefits, and some more challenging tests of performance.
Questions and thoughts
1 Can the increase in robustness be attributed to trivial forms of contextual integration?
If the filters were isotropic Gaussian blobs, then the diffusion process would simply blur the image. Blurring can help reduce noise and might reduce susceptibility to adversarial perturbations (especially if the adversary is not enabled to take this into account). Image blurring could be considered the layer-0 version of the proposed model. What is its effect on performance?
Consider another simplified scenario: If the network were linear, then the lateral connectivity would modify the effective filters, but each filter would still be a linear combination of the input. The model with lateral connectivity could thus be replaced by an equivalent feedforward model with larger kernels. Larger kernels might yield responses that are more robust to noise. Here the activation function is nonlinear, but the benefits might work similarly. It would be good to assess whether larger kernels in a feedforward network bring similar benefits to generalization performance.
2 Were the adversarial perturbations targeted at the tested model?
Robustness to adversarial attack should be tested using adversarial examples targeting each particular model with a given combination of numbers of iterations of lateral diffusion in layers 1 and 2. Was this the case?
3 Is the lateral diffusion process invertible?
The lateral diffusion is a linear transform that maps to a space of equal dimension (like Gaussian blurring of an image).
If the transform were invertible, then it would constitute the simplest possible change (linear, information preserving) to the representational geometry (as characterized by the Euclidean representational distance matrix for a set of stimuli). To better understand why this transform helps, then, it would be interesting to investigate how it changes the representational geometry for a suitable set of stimuli.
If lateral diffusion were not invertible, then it is perhaps best thought of as an intelligent type of pooling (despite the output dimension being equal to the input dimension).
4 Do the lateral connections make representations of corrupted images more similar to representations of uncorrupted versions of the same images?
The authors offer an intuitive explanation of the benefits to performance: Lateral diffusion restores the missing parts or repairs what has been corrupted (presumably using accurate prior information about the distribution of natural images). One could directly assess whether this is the case by assessing whether lateral diffusion moves the representation of a corrupted image closer to the representation of its uncorrupted variant.
5 Do correlated filter templates imply correlated filter responses under natural stimulation?
Learned filters reflect features that occur in the training images. If each image is composed of a mosaic of overlapping features, it is intuitive that filters whose templates overlap and are correlated will tend to co-occur and hence yield correlated responses across natural images. The authors seem to assume that this is true. But is there a way to prove that the correlations between filter templates really imply correlation of the filter outputs under natural stimulation? For independent noise images, filters with correlated templates will surely produce correlated outputs. However, it’s easy to imagine stimuli for which filters with correlated templates yield uncorrelated or anticorrelated outputs.
6 Does lateral connectivity reflecting the correlational structure of filter responses under natural stimulation work even better than the proposed approach?
Would the performance gains be larger or smaller if lateral connectivity were determined by filter-output correlation under natural stimulation, rather than by filter-template similarity?
Is filter-template similarity just a useful approximation to filter-output correlation under natural stimulation, or is there a more fundamental computational motivation for using it?
7 How does the proposed lateral connectivity compare to learned lateral connectivity when the number of connections (instead of the number of parameters) is matched?
It would be good to compare CNNs with lateral diffusive connectivity to recurrent convolutional neural networks (RCNNs) for matched sizes of bottom-up and lateral filters (and matched numbers of connections, not parameters). In addition, it would then be interesting to initialize the RCNNs with diffusive lateral connectivity according to the proposed model (after initial training without lateral connections). Lateral connections could precede (as in typical RCNNs) or follow (as in KerCNNs) the nonlinear activation function.
8 Does the proposed mechanism have a motivation in terms of a normative model of visual inference?
Can the intuition that lateral connections implement shrinkage to a prior about natural image statistics be more explicitly justified?
If the filters serve to infer features of a linear generative model of the image, then features with correlated templates are anti-correlated given the image (competing to explain the same variance). This suggests that inhibitory connections are needed to implement the dynamics for inference. Cortex does rely on local inhibition. How does local inhibitory connectivity fit into the picture?
Can associative filling in and competitive explaining away be reconciled and combined?
A mathematical model of lateral connectivity, motivated by human visual contour integration and studies on V1 long-range lateral connectivity, is tested in terms of the computational benefits it brings in the context of CNNs that recognize images.
The model is intuitive, elegant, and parsimonious in that it does not require learning of additional parameters.
The paper presents initial evidence for improved generalization performance in the context of deep convolutional neural networks.
The computational benefits of the proposed lateral connectivity is tested only in the context of toy tasks and two-layer neural networks.
Some trivial explanations for the performance benefits have not been ruled out yet.
It’s unclear how to choose the number of iterations of lateral diffusion for each of the the two layers, and choosing the best combination might positively bias the estimate of the gain in accuracy.
“associated to” -> “associated with” (in several places)
Nakai and Nishimoto (pp2019) had each of six subjects perform 103 naturalistic cognitive tasks during functional magnetic resonance imaging (fMRI) of their brain activity. This type of data could eventually enable us to more compellingly characterize the localization of cognitive task components across the human brain.
What is unique about this paper is the fact that it explores the space of cognitive tasks more systematically and comprehensively than any previous fMRI study I am aware of. It’s important to have data from many tasks in the same subjects to more quantitatively model how cognitive components, implemented in different parts of the brain, contribute in combination to different tasks.
The authors describe the space of tasks using a binary task-type model (with indicators for task components) and a continuous cognitive-factor model (with prior information from the literature incorporated via Neurosynth). They perform encoding and decoding analyses and investigate the clustering of task-related brain activity patterns. The model-based analyses are interesting, but also a bit hard to interpret, because they reveal the data only indirectly: through the lens of the models – and the models are very complex. It would be good to see some more basic “data-driven” analyses, as the title suggests.
However, the more important point is that this is a visionary contribution from an experimental point of view. The study pushes the envelope of cognitive fMRI. The biggest novel contributions are:
the task set (with its descriptive models)
the data (in six subjects)
Should the authors choose to continue to work on this, my main suggestions are (1) to add some more interpretable data-driven analyses, and (2) to strengthen the open science component of the study (by sharing the data, task and analysis code, and models), so that it can form a seed for much future work that builds on these tasks, expanding the models, the data, and the analyses beyond what can be achieved by a single lab.
This rich set of tasks and human fMRI responses deserves to be analyzed with a wider array of models and methods in future studies. For example, it would be great in the future to test a wide variety of task-descriptive models. Eventually it might also be possible to build neural network models that can perform the entire set of tasks. Explaining the measured brain-activity with such brain-computational models would get us closer to understanding the underlying information processing. In addition, the experiment deserves to be expanded to more subjects (perhaps 100). This could produce a canonical basis for revisiting human cognitive fMRI at a greater level of rigor. These directions may not be realistic for a single study or a single lab. However, this paper could be seminal to the pursuit of these directions as an open science endeavor across labs.
Improvements to consider if the authors chose to revise the paper
(1) Reconsider the phrase “data-driven models” (title)
The phrase “data-driven models” suggests that the analysis is both data-driven and model-based. This suggests the conceptualization of data-driven and model-based as two independent dimensions.
In this conceptualization, an analysis could be low on both dimensions, restricting the data to a small set (e.g. a single brain region) and failing to bring theory into the analysis through a model of some complexity (e.g. instead computing overall activation in the brain region for each experimental condition). Being high on both dimensions, then, appears desirable. It would mean that the assumptions (though perhaps strong) are explicit in the model (and ideally justified), and that the data still richly inform the results.
Arguably this is the case here. The models the authors used have many parameters and so the data richly inform the results. However, the models also strongly constrain the results (and indeed changing the model might substantially alter the results – more on that below).
But an alternative conceptualization, which seems to me more consistent with popular usage of these terms, is that there is a tradeoff between data-driven and model-based. In this conceptualization the overall richness of the results (how many independent quantities are reported) is considered a separate dimension. Any analysis combines data and assumptions (with the latter ideally made explicit in a model). If the model assumptions are weak (compared to the typical study in the same field), an analysis is referred to as data-driven. If the model assumptions are strong, then an analysis is referred to as model-driven. In this conceptualization, “data-driven model” is an oxymoron.
(2) Perform a data-driven (and model-independent) analysis of how tasks are related in terms of the brain regions they involve
“A sparse task-type encoding model revealed a hierarchical organization of cognitive tasks, their representation in cognitive space, and their mapping onto the cortex.” (abstract)
I am struggling to understand (1) what exact claims are made here, (2) how they are justified by the results, and (3) how they would constrain brain theory if true. The phrases “organization of cognitive tasks” and “representation in cognitive space” are vague.
The term hierarchical (together with the fact that a hierarchical cluster analysis was performed) suggests that (a) the activity patterns fall in clusters rather than spreading over a continuum and (b) the main clusters contain nested subclusters.
However, the analysis does not assess the degree to which the task-related brain activity patterns cluster. Instead a complex task-type model (whose details and influence on the results the reader cannot assess) is interposed. The model filters the data (for example preventing unmodeled task components from influencing the clustering). The outcome of clustering will also be affected by the prior over model weights.
A simpler, more data-driven, and interpretable analysis would be to estimate a brain activity pattern for each task and investigate the representational geometry of those patterns directly. It would be good to see the representational dissimilarity matrix and/or and visualization (MDS or t-SNE) of these patterns.
To formally address whether the patterns fall into clusters (and hierarchical clusters), it would be ideal to inferentially compare cluster (and hierarchical cluster) models to continuous models. For example, one could fit each model to a training set and assess whether the models’ predictive performance differs on an independent test set. (This is in contrast to hierarchical cluster analysis, which assumes a hierarchical cluster structure rather than inferring the presence of such a structure from the data.)
(3) Perform a simple pairwise task decoding analysis
It’s great that the decoding analysis generalizes to new tasks. But this requires model-based generalization. It would be useful, additionally, to use decoding to assess the discriminability of the task-related activity patterns in a less model-dependent way.
One could fit a linear discriminant for each pair of tasks and test on independent data from the same subject performing the same two tasks again. (If the accuracy were replaced by the linear discriminant t value or crossnobis estimator, then this could also form the basis for point (2) above.)
“A cognitive factor encoding model utilizing continuous intermediate features by using metadata-based inferences predicted brain activation patterns for more than 80 % of the cerebral cortex and decoded more than 95 % of tasks, even under novel task conditions.” (abstract)
The numbers 80% and 95% are not meaningful in the absence of additional information (more than 80% of the voxel responses predicted significantly above chance level, and more than 95% of the tasks were significantly distinct from at least some other tasks). You could either add the information needed to interpret these numbers to the abstract or remove the numbers from the abstract. (The abstract should be interpretable in isolation.)
New behavioral monitoring and neural-net modeling techniques are revolutionizing animal neuroscience. How can we use the new tools to understand how brains implement cognitive processes? Musall, Urai, Sussillo and Churchland (pp2019) argue that these tools enable a less reductive approach to experimentation, where the tasks are more complex and natural, and brain and behavior are more comprehensively measured and modeled. (The picture above is Figure 1 of the paper.)
There have recently been amazing advances in measurement, modeling, and manipulation of complex brain and behavioral dynamics in rodents and other animals. These advances point toward the ultimate goal of total experimental control, where the environment as well as the animal’s brain and behavior are comprehensively measured and where both environment and brain activity can be arbitrarily manipulated. The review paper by Musall et al. focuses on the role that monitoring and modeling complex behaviors can play in the context of modern neuroscientific animal experimentation. In particular, the authors consider the following elements:
Rich task environments: Rodents and other animals can be placed in virtual-reality experiments where they experience complex visual and other sensory stimuli. Researchers can richly and flexibly control the virtual environment, combining naturalistic and unnaturalistic elements to optimize the experiment for the question of interest.
Comprehensive measurement of behavior: The animal’s complex behavior can be captured in detail (e.g. running on a track ball and being videoed to measure running velocity and turns as well as subtle task-unrelated limb movements). The combination of video and novel neural-net-model-based computer vision, enables researchers to track the trajectories of multiple limbs simultaneously with great precision. Instead of focusing on binary choices and reaction times, some researchers now use comprehensive and detailed quantitative measurements of behavioral dynamics.
Data-driven modeling of behavioral dynamics: The richer quantitative measurements of behavioral dynamics enable the data-driven discovery of the dynamical components of behavior. These components can be continuous or categorical. An example of categorical components are behavioral motifs (categories of similar behavioral patterns). Such motifs used to be inferred subjectively by researchers observing the animals. Today they can be inferred more objectively, using probabilistic models and machine learning. These methods can learn the repertoire of motifs, and, given new data, infer the motifs and the parameters of each instantiation of a motif.
Cognitive models of task performance: Cognitive models of task performance provide the latent variables that the animal’s brain must represent to be able to perform the task. The latent variables connect stimuli to behavioral responses and enable us to take a normative, top-down perspective: What information processing should the animal perform to succeed at the task?
Comprehensive measurement of neural activity: Techniques for measuring neural activity, including multi-electrode recording devices (e.g. Neuropixels) and optical imaging techniques (e.g. Calcium imaging) have advanced to enable the simultaneous measurement of many thousands of neurons with cellular precision.
Modeling of neural dynamics: Neural-network models provide task-performing models of brain-information processing. These models abstract sufficiently from neurobiology to be efficiently simulated and trained, but are neurobiologically plausible in that they could be implemented with biological components. (One might say that these models leave out biological complexity at the cellular scale so as to be able to better capture the dynamic complexity at a larger scale, which might help us understand how the brain implements control of behavior.)
The paper provides a great concise introduction to these exciting developments and describes how the new techniques can be used in concert to help us understand how brains implement cognition. The authors focus on the role of monitoring and modeling behavior. They stress the need to capture uninstructed movements, i.e. movements that are not required for task performance, but nevertheless occur and often explain large amounts of variance in neural activity. They also emphasize the importance of behavioral variation across trials, brain states, and individuals. Detailed quantitative descriptions of behavioral dynamics enable researchers to model nuisance variation and also to understand the variation of performance across trials, which can reflect variation related to the brain state (e.g. arousal, fear), cognitive strategy (different algorithms for performing the task), and the individual studied (after all, every mouse is unique –– see figure above, which is Figure 1 in the paper).
Improvements to consider in case the paper is revised
The paper is well-written and useful already. In case the authors were to prepare a revision, they could consider improving it further by addressing some of the following points.
(1) Add a figure illustrating the envisaged style of experimentation and modeling.
It might be helpful for the reader to have another figure, illustrating how the different innovations fit together. Such a figure could be based on an existing study, or it could illustrate an ideal for future experimentation, amalgamating elements from different studies.
(2) Clarify what is meant by “understanding circuits” and the role of NNs as “tools” and “model organisms”.
The paper uses the term “circuit” in the title and throughout as the explanandum. The term “circuit” evokes a particular level of description: above the single neuron and below “systems”. The term is associated with small subsets of interacting neurons (sometimes identified neurons), whose dynamics can be understood in detail.
This is somewhat at a tension with the approach of neural-network modeling, where there isn’t necessarily a one-to-one mapping between units in the model and neurons in the brain. The neural-network modeling would appear to settle for a somewhat looser relationship between the model and the brain. There is a case to be made that this is necessary to enable us to engage higher-level cognitive processes.
The authors hint at their view of this issue by referring to the neural-network models as “artificial model organisms”. This suggests a feeling that these models are more like other biological species (e.g. the mouse “model”) than like data-analytical models. However, models are never identical to the phenomena they capture and the relationship between model and empirical phenomenon (i.e. what aspects of the data the model is supposed to predict) must be separately defined anyway. So why not consider the neural-network models more simply as models of brain information processing?
(3) Explain how the insights apply across animal species.
The basic argument of the paper in favor of comprehensive monitoring and modeling of behavior appears to hold equally for C. elegans, zebrafish, flies, rodents, tree shrews, marmosets, macaques, and humans. However, the paper appears to focus on rodents. Does the rationale change across species? If so how and why? Should human researchers not consider the same comprehensive measurement of behavior for the very same reasons?
(4) Clarify the relation to similar recent arguments.
Several authors have recently argued that behavioral modeling must play a key role if we are to understand how the brain implements cognitive processes (Krakauer et al. 2017, Neuron [cited already]; Yamins & DiCarlo 2016, Nature Neuroscience; Kriegeskorte & Douglas 2018, Nature Neuroscience 2018). It would be interesting to hear how the authors see the relationship between these arguments and the one they are making.
Rajesh Rao (pp2019) gives a concise review of the current state of the art in bidirectional brain-computer interfaces (BCIs) and offers an inspiring glimpse of a vision for future BCIs, conceptualized as neural co-processors.
A BCI, as the name suggests, connects a computer to a brain, either by reading out brain signals or by writing in brain signals. BCIs that both read from and write to the nervous system are called bidirectional BCIs. The reading may employ recordings from electrodes implanted in the brain or located on the scalp, and the writing must rely on some form of stimulation (e.g., again, through electrodes).
An organism in interaction with its environment forms a massively parallel perception-to-action cycle. The causal routes through the nervous system range in complexity from reflexes to higher cognition and memories at the temporal scale of the life span. The causal routes through the world, similarly, range from direct effects of our movements feeding back into our senses, to distal effects of our actions years down the line.
Any BCI must insert itself somewhere in this cycle – to supplement, or complement, some function. Typically a BCI, just like a brain, will take some input and produce some output. The input can come from the organism’s nervous system or body, or from the environment. The output, likewise, can go into the organism’s nervous system or body, or into the environment.
This immediately suggests a range of medical applications (Figs. 1, 2):
replacing lost perceptual function: The BCI’s input comes from the world (e.g. visual or auditory signals) and the output goes to the nervous system.
replacing lost motor function: The BCI’s input comes from the nervous system (e.g. recordings of motor cortical activity) and the output is a prosthetic device that can manipulate the world (Fig. 1).
bridging lost connectivity or replacing lost nervous processing: The BCI’s input comes from the nervous system and the output is fed back into the nervous system (Fig. 2).
Fig. 1 | Uni- and bidirectional prosthetic-control BCIs. (a) A unidirectional BCI (red) for control of a prosthetic hand that reads out neural signals from motor cortex. The patient controls the hand using visual feedback (blue arrow). (b) A bidirectional BCI (red) for control of a prosthetic hand that reads out neural signals from motor cortex and feeds back tactile sensory signals acquired through artificial sensors to somatosensory cortex.
Beyond restoring lost function, BCIs have inspired visions of brain augmentation that would enable us to transcend normal function. For example, BCI’s might enable us to perceive, communicate, or act at higher bandwidth. While interesting to consider, current BCIs are far from achieving the bandwidth (bits per second) of our evolved input and output interfaces, such as our eyes and ears, our arms and legs. It’s fun to think that we might write a text in an instant with a BCI. However, what limits me in writing this open review is not my hands or the keyboard (I could use dictation instead), but the speed of my thoughts. My typing may be slower than the flight of my thoughts, but my thoughts are too slow to generate an acceptable text at the pace I can comfortably type.
But what if we could augment thought itself with a BCI? This would require the BCI to listen in to our brain activity as well as help shape and direct our thoughts. In other words, the BCI would have to be bidirectional and act as a neural co-processor (Fig. 3). The idea of such a system helping me think is science fiction for the moment, but bidirectional BCIs are a reality.
I might consider my laptop a very functional co-processor for my brain. However, it doesn’t include a BCI, because it neither reads from nor writes to my nervous system directly. It instead senses my keystrokes and sends out patterns of light, co-opting my evolved biological mechanisms for interfacing with the world: my hands and eyes, which provide a bandwidth of communication that is out of reach of current BCIs.
Fig. 2 | Bidirectional motor and sensory BCIs. (a) A bidirectional motor BCI (red) that bridges a spinal cord injury, reading signals from motor cortex and writing into efferent nerves beyond the point of injury or directly contacting the muscles. (b) A bidirectional sensory BCI that bridges a lesion along the sensory signalling pathway.
Rao reviews the exciting range of proof-of-principle demonstrations of bidirectional BCIs in the literature:
Closed-loop prosthetic control: A bidirectional BCI may read out motor cortex to control a prosthetic arm that has sensors whose signals are written back into somatosensory cortex, replacing proprioceptive signals. (Note that even a unidirectional BCI that only records activity to steer the prosthetic device will be operated in a closed loop when the patient controls it while visually observing its movement. However, a bidirectional BCI can simultaneously supplement both the output and the input, promising additional benefits.)
Reanimating paralyzed limbs: A bidirectional BCI may bridge a spinal cord injury, e.g. reading from motor cortex and writing to the efferent nerves beyond the point of injury in the spinal cord or directly to the muscles.
Restoring motor and cognitive functions: A bidirectional BCI might detect a particular brain state and then trigger stimulation is a particular region. For example, a BCI may detect the impending onset of an epileptic seizure in a human and then stimulate the focus region to prevent the seizure.
Augmenting normal brain function: A study in monkeys demonstrated that performance on a delayed-matching-to-sample task can be enhanced by reading out the CA3 representation and writing to the CA1 representation in the hippocampus (after training a machine learning model on the patterns during normal task performance). BCIs reading from and writing to brains have also been used as (currently still very inefficient) brain-to-brain communication devices among rats and humans.
Inducing plasticity and rewiring the brain: It has been demonstrated that sequential stimulation of two neural sites A and B can induce Hebbian plasticity such that the connections from A to B are strengthened. This might eventually be useful for restoration of lost connectivity.
Most BCIs use linear decoders to read out neural activity. The latent variables to be decoded might be the positions and velocities capturing the state of a prosthetic hand, for example. The neural measurements are noisy and incomplete, so it is desirable to combine the evidence over time. The system should use not only the current neural activity pattern to decode the latent variables, but also the recent history. Moreover, it should use any prior knowledge we might have about the dynamics of the latent variables. For example, the components of a prosthetic arm are inert masses. Forces upon them cause acceleration, i.e. a change of velocity, which in turn changes the positions. The physics, thus, entails smooth positional trajectories.
When the neuronal activity patterns linearly encode the latent variables, the dynamics of the latent variables is also linear, and the noise is Gaussian, then the optimal way of inferring the latent variables is called a Kalman filter. The state vector for the Kalman filter may contain the kinematic quantities whose brain representation is to be estimated (e.g. the position, velocity, and acceleration of a prosthetic hand). A dynamics model that respects the laws of physics can help constrain the inference so as to obtain more reliable estimates of the latent variables.
For a perceptual BCI, similarly, the signals from the artificial sensors might be noisy and we might have prior knowledge about the latent variables to be encoded. Encoders, as well as decoders, thus, can benefit from using models that capture relevant information about the recent input history in their internal state and use optimal inference algorithms that exploit prior knowledge about the latent dynamics. Bidirectional BCIs, as we have seen, combine neural decoders and encoders. They form the basis for a more general concept that Rao introduces: the concept of a neural co-processor.
Fig. 3 | Devices augmenting our thoughts. (a) A laptop computer (black) that interfaces with our brains through our hands and eyes (not a BCI). (b) A neural co-processor that reads out neural signals from one region of the brain and writes in signals into another region of the brain (bidirectional BCI).
The term neural co-processor shifts the focus from the interface (where brain activity is read out and/or written in) to the augmentation of information processing that the device provides. The concept further emphasizes that the device processes information along with the brain, with the goal to supplement or complement what the brain does.
The framework for neural co-processors that Rao outlines generalizes bidirectional BCI technology in several respects:
The device and the user’s brain jointly optimize a behavioral cost function:
BCIs from the earliest days have involved animals or humans learning to control some aspect of brain activity (e.g. the activity of a single neuron). Conversely, BCIs standardly employ machine learning to pick up on the patterns of brain activity that carry a particular meaning. The machine learning of patterns associated, say, with particular actions or movements is often followed by the patient learning to operate the BCI. In this sense mutual co-adaptation is already standard practice. However, the machine learning is usually limited to an initial phase. We might expect continual mutual co-adaptation (as observed in human interaction and other complex forms of communication between animals and even machines) to be ultimately required for optimal performance.
Decoding and encoding models are integrated: The decoder (which processes the neural data the device reads as its input) and encoder (which prepares the output for writing into the brain) are implemented in a single integrated model.
Recurrent neural network models replace Kalman filters: While a Kalman filter is optimal for linear systems with Gaussian noise, recurrent neural networks provide a general modeling framework for nonlinear decoding and encoding, and nonlinear dynamics.
Stochastic gradient descent is used to adjust the co-processor so as to optimize behavioral accuracy: In order to train a deep neural network model as a neural co-processor, we would like to be able to apply stochastic gradient descent. This poses two challenges: (1) We need a behavioral error signal that measures how far off the mark the combined brain-co-processor system is during behavior. (2) We need to be able to backpropagate the error derivatives. This requires that we have a mathematically specified model not only for the co-processor, but also for any further processing performed by the brain to produce the behavior whose error is to drive the learning. The brain-information processing from co-processor output to behavioral response is modeled by an emulator model. This enables us to backpropagate the error derivatives from the behavioral error measurements to the co-processor and through the co-processor. Although backpropagation proceeds through the emulator first, only the co-processor learns (as the emulator is not involved in the interaction and only serves to enable backpropagation). The emulator needs to be trained to emulate the part of the perception-to-action cycle it is meant to capture as well as possible.
The idea of neural co-processors provides an attractive unifying framework for developing devices that augment brain function in some way, based on artificial neural networks and deep learning.
Intriguingly, Rao argues that neural co-processors might also be able to restore or extend the brain’s own processing capabilities. As mentioned above, it has been demonstrated that Hebbian plasticity can be induced via stimulation. A neural co-processor might initially complement processing by performing some representational transformation for the brain. The brain might then gradually learn to predict the stimulation patterns contributed by the co-processor. The co-processor would scaffold the processing until the brain has acquired and can take over the representational transformation by itself. Whether this would actually work remains to be seen.
The framework of neural co-processors might also be relevant for basic science, where the goal is to build models of normal brain-information processing. In a basic-science context, the goal is to drive the model parameters to best predict brain activity and behavior. The error derivatives of the brain or behavioral predictions might be continuously backpropagated through a model during interactive behavior, so as to optimize the model.
Overall, this paper gives an exciting concise view of the state of the literature on bidirectional BCIs, and the concept of neural co-processors provides an inspiring way to think about the bigger picture and future directions for this technology.
The paper is well-written and gives a brief, but precise overview of the current state of the art in bidirectional BCI technology.
The paper offers an inspiring unifying framework for understanding bidirectional BCIs as neural co-processors that suggests exciting future developments.
The neural co-processor idea is not explained as intuitively and comprehensively as it could be.
The paper could give readers from other fields a better sense of quantitative benchmarks for BCIs.
Improvements to consider in revision
The text is already at a high level of quality. These are just ideas for further improvements or future extensions.
The figure about neural co-processors could be improved. In particular, the author could consider whether it might help to
clarify the direction of information flow in the brain and the two neural networks (clearly discernible arrows everywhere)
illustrate the parallelism between the preserved healthy output information flow (e.g. M1->spinal cord->muscle->hand movement) and the emulator network
illustrate the function intuitively using plausible choices of brain regions to read from (PFC? PPC?) and write to (M1? – flipping the brain?)
illustrate an intuitive example, e.g. a lesion in the brain, with function supplemented by the neural co-processor
add an external actuator to illustrate that the co-processor might directly interact with the world via motors as well as sensors
clarify the source of the error signal
The text on neural co-processors is very clear, but could be expanded by considering another example application in an additional paragraph to better illustrate the points made conceptually about the merits and generality of the approach.
The expected challenges on the path to making neural co-processors work could be discussed in more detail.
It would be good to clarify how the behavioral error signals to be backpropagated would be obtained in practice, for example, in the context of motor control.
Should we expect that it might be tractable to learn the emulator and co-processor models under realistic conditions? If so, what applied and basic science scenarios might be most promising to try first?
If the neural co-processor approach were applied to closed-loop prosthetic arm control, there would have to be two separate co-processors (motor cortex -> artificial actuators, artificial sensors -> sensory cortex) and so the emulator would need to model the brain dynamics intervening between perception and action.
It would be great to include some quantitative benchmarks (in case they exist) on the performance of current state-of-the-art BCIs (e.g. bit rate) and a bit of text that realistically assesses where we are on the continuum between proof of concept and widely useful application for some key applications. For example, I’m left wondering: What’s the current maximum bit rate of BCI motor control? How does this compare to natural motor control signals, such as eye blinks? Does a bidirectional BCI with sensory feedback improve the bit rate (despite the fact that there is already also visual feedback)?
It would be helpful to include a table of the most notable BCIs built so far, comparing them in terms of inputs, outputs, notable achievements and limitations, bit rate, and encoding and decoding models employed.
The current draft lacks a conclusion that draws the elements together into an overall view.
Bobadilla-Suarez, Ahlheim, Mehrotra, Panos, & Love (pp2018) set out to shed some light on the best choice of similarity measure for analyzing distributed brain representations. They take an empirical approach, starting with the assumption that a good measure of neural similarity should reflect the degree to which an optimal decoder confuses two stimuli.
Decoding indeed provides a useful perspective for thinking about representational dissimilarities. Defining decoders helps us consider explicitly how other brain regions might read out a representation, and to base our analyses of brain activity on reasonable assumptions.
Using two different data sets, the authors report that Euclidean and Mahalanobis distances, respectively, are most highly correlated (Spearman correlation across pairs of stimuli) with decoding accuracy. They conclude that this suggests that Euclidean and Mahalanobis distances are preferable to the popular Pearson correlation distance as a choice of representational dissimilarity measure.
Decoding analyses provide an attractive approach to the assessment of representational dissimilarity for two reasons:
Decoders can help us test whether particular information is present in a format that could be directly read out by a downstream neuron. This requires the decoder to be plausibly implementable by a single neuron, which holds for linear readout (if we assume that the readout neuron can see a sufficient portion of the code). While this provides a good motivation for linear decoding analyses, we need to be mindful of a few caveats: Single neurons might also be capable of various forms of nonlinear readout. Moreover, neurons might have access to a different portion of the neuronal information than is used in a particular decoding analysis. For example, readout neurons might have access to more information about the neuronal responses than we were able to measure (e.g. with fMRI, where each voxel indirectly reflects the activity of tens or hundreds of thousands of neurons; or with cell recordings, where we can often sample only tens or hundreds of neurons from a population of millions). Conversely, our decoder might have access to a larger neuronal population than any single readout neuron (e.g. to all of V1 or some other large region of interest).
Decoding accuracy can be assessed with an independent test set. This removes overfitting bias of the estimate of discriminability and enables us to assess whether two activity patterns really differ without relying on assumptions (such as Gaussian noise) for the validity of this inference.
This suggests using decoding directly to measure representational dissimilarity. For example, we could use decoding accuracy as a measure of dissimilarity (e.g. Carlson et al. 2013, Cichy et al. 2015). The paper’s rationale to evaluate different dissimilarity measures by comparison to decoding accuracy therefore does not make sense to me. If decoding accuracy is to be considered the gold standard, then why not use that gold standard itself, rather than a distinct dissimilarity measure that serves as a stand in?
In fact the motivation for using Pearson correlation distance for comparing brain-activity patterns is not to emulate decoding accuracy, but to describe to what extent two experimental conditions push the baseline activity pattern in different directions in multivariate response space: The correlation distance is 1 minus the cosine of the angle the two patterns span (after the regional-mean activation has been subtracted out from each).
Interestingly, the correlation distance is proportional to the squared Euclidean distance between the normalized patterns (where each pattern has been separately normalized by first subtracting the mean from each value and then scaling the norm to 1; see Fig. 1, below and Walther et al. 2016). So in comparing the Euclidean distance to correlation distance, the question becomes whether those normalizations (and the squaring) are desirable.
One motivation for removing the mean is to make the pattern analysis more complementary to the regional-mean activation analysis, which many researchers standardly also perform. Note that this motivation is at odds with the desire to best emulate decoding results because most decoders, by default, will exploit regional-mean activation differences as well as fine-grained pattern differences.
The finding that Euclidean and Mahalanobis distances better predicted decoding accuracies here than correlation distance, could have either or both of the following causes:
Correlation distance normalizes out to the regional-mean component. On the one hand, regional-mean effects are large and will often contribute to successful decoding. On the other hand, removing the regional-mean is a very ineffective way to remove overall-activation effects (especially different voxels respond with different gains). Removing the regional mean, therefore, may hardly affect the accuracy of a linear decoder (as shown for a particular data set in Misaki et al. 2010).
Correlation distance normalizes out the pattern variance across voxels. The divisive normalization of the variance around the mean has an undesirable effect: Two experimental conditions that do not drive a response and therefore have uncorrelated patterns (noise only, r ≈ 0) appear very dissimilar (1 – r ≈ 1). If we used a decoder, we would find that the two conditions that don’t drive responses are indistinguishable, despite their substantial correlation distance. This has been explained and illustrated by Walther et al. (2016; Fig. 2, below). Note that the two stimuli would be indistinguishable, even if the decoder was based on correlation distance (e.g. Haxby et al. 2001). It is the independent test set used in decoding that makes the difference here.
Normalizing each pattern (by subtracting the regional mean and/or dividing by the standard deviation across voxels) is a defensible choice – despite the fact that it might make dissimilarities less correlated with linear decoding accuracies (when the latter are based on different normalization choices). However, it is desirable to use crossvalidation (as is typically used in decoding) to remove bias.
The dichotomy of decoding versus dissimilarity is misleading, because any decoder is based on some notion of dissimilarity. The minimum-correlation-distance decoder (Haxby et al. 2001) is one case in point. The Fisher linear discriminant can similarly be interpreted as a minimum-Mahalanobis-distance classifier. Decoders imply dissimilarities, requiring the same fundamental choices, so the dichotomy appears unhelpful.
To get around the issue of choosing a decoder, the authors argue that the relevant decoder is the optimal decoder. However, this doesn’t solve the problem. Imagine we applied the optimal decoder to representations of object images in the retina and in inferior temporal (IT) cortex. As the amount of data we use grows, every image will become discriminable from every other image with 100% accuracy in both the retina and IT cortex (for a typical set of natural photographs). If we attempted to decode categories, every category would eventually become discernable in the retinal patterns.
Given enough data and flexibility with our decoder, we end up characterizing the encoded information, but not the format in which it is encoded. The encoded information would be useful to know (e.g. IT might carry less information about the stimulus than the retina). However, we are usually also (and often more) interested in the “explicit” information, i.e. in the information accessible to a simple, biologically plausible decoder (e.g. the category information, which is explicit in IT, but not in the retina).
The motivation for measuring representational dissimilarities is typically to characterize the representational geometry, which tells us not just the encoded information (in conjunction with a noise model), but also the format (up to an affine transform). The representational geometry defines how well any decoder capable of an affine transform can perform.
In sum, in selecting our measure of representational dissimilarity we (implicitly or explicitly) make a number of choices:
Should the patterns be normalized and, if so, how?
This will make us insensitive to certain dimensions of the response space, such as the overall mean, which may be desirable despite reducing the similarity of our results to those obtained with optimal decoders.
Should the measure reflect the representational geometry? Euclidean and Mahalanobis distance characterize the geometry (before or after whitening the noise, respectively). By contrast, saturating functions of these distances such as decoding accuracy or mutual information (for decoding stimulus pairs) do not optimally reflect the geometry. See Figs. 3, 4 below for the monotonic relationships among distance (measured along the Fisher linear discriminant), decoding accuracy, and mutual information between stimulus and response.
Should we use independent data to remove the positive bias of the dissimilarity estimate? Independent data (as in crossvalidation) can be used to remove the positive bias not only of the training-set accuracy of a decoder, but also of an estimate of a distance on the basis of noisy data (Kriegeskorte et al. 2007, Nili et al. 2014, Walther et al. 2016).
Linear decodability is widely used as a measure of representational distinctness, because decoding results are more relevant to neural computation when the decoder is biologically plausible for a single neuron. The advantages of linear decoding (interpretability, bias removal by crossvalidation) can be combined with the advantages of distances (non-quantization, non-saturation, characterization of representational geometry) and this is standardly done in representational similarity analysis by using the linear discriminant t (LD-t) value (Kriegeskorte et al. 2007, Nili et al. 2014) or the crossnobis estimator (Walther et al. 2016, Diedrichsen et al. 2016, Kriegeskorte & Diedrichsen 2016, Diedrichsen & Kriegeskorte 2017, Carlin et al. 2017). These measures of representational dissimilarity combine the advantages of decoding accuracies and continuous dissimilarity measures:
Biological plausibility: Like linear decoders, they reflect what can plausibly be directly read out.
Bias removal: As in linear decoding analyses, crossvalidation (1) removes the positive bias (which similarly affects training-set accuracies and distance functions applied to noisy data) and (2) provides robust frequentist tests of discriminability. For example, the crossnobis estimator provides an unbiased estimate of the Mahalanobis distance (Walther et al. 2016) with an interpretable 0 point.
Non-quantization: Unlike decoding accuracies, crossnobis and LD-t estimates are continuous estimates, uncompromised by quantization. Decoding accuracies, in contrast, are quantized by thresholding (based on often small counts of correct and incorrect predictions), which can reduce statistical efficiency (Walther et al. 2016).
Non-saturation: Unlike decoding accuracies, crossnobis and LD-t estimates do not saturate. Decoding accuracies suffer from a ceiling effect when two patterns that are already well-discriminable are moved further apart. Crossnobis and LD-t estimates proportionally reflect the true distances in the representational space.
The paper considers a wide range of dissimilarity measures (though these are not fully defined or explained).
The paper uses two fMRI data sets to compare many dissimilarity measures across many locations in the brain.
The premise of the paper that optimal decoders are the gold standard does not make sense.
Even if decoding accuracy (e.g. linear) were taken as the standard to aspire to, why not use it directly, instead of a stand-in dissimilarity measure?
The paper lags behind the state of the literature, where researchers routinely use dissimilarity measures that are either based on decoding or that combine the advantages of decoding accuracies and continuous distances.
The premise that the optimal decoder should be the gold standard by which to choose a similarity measure does not make sense, because the optimal decoder reveals only the encoded information, but nothing about its format and what information is directly accessible to readout neurons.
If linear decoding accuracy (or the accuracy of some other simple decoder) is to be considered the gold standard measure of representational dissimilarity, then why not use the gold standard itself instead of a different dissimilarity measure?
In fact, representational similarity analyses using decoder accuracies and linear discriminability measures (LD-t, crossnobis) are widely used in the literature (Kriegeskorte et al. 2007, Nili et al. 2014, Cichy et al. 2014, Carlin et al. 2017 to name just a few).
One motivation for using the Pearson correlation distance to measure representational dissimilarity is to reduce the degree to which regional-mean activation differences affect the analyses. Researchers generally understand that Pearson correlation is not ideal from a decoding perspective, but prefer to choose a measure more complementary to regional-mean activation analyses. This motivation is inconsistent with the premise that decoder confusability should be the gold standard.
A better argument against using the Pearson correlation distance is that it has the undesirable property that it renders indistinguishable the case when two stimuli elicit very distinct response patterns and the case when neither stimulus drives the region strongly (and the pattern estimates are therefore noise and uncorrelated).
An elegant new study by Bracci, Kalfas & Op de Beeck (pp2018) suggests that the prominent division between animate and inanimate things in the human ventral stream’s representational space is based on a superficial analysis of visual appearance, rather than on a deeper analysis of whether the thing before us is a living thing or a lifeless object.
Bracci et al. assembled a beautiful set of stimuli divided into 9 equivalent triads (Figure 1). Each triad consists of an animal, a manmade object, and a kind of hybrid of the two: an artefact of the same category and function as the object, designed to resemble the animal in the triad.
Bracci et al. measured response patterns to each of the 27 stimuli (stimulus duration: 1.5 s) using functional magnetic resonance imaging (fMRI) with blood-oxygen-level-dependent (BOLD) contrast and voxels of 3-mm width in each dimension. Sixteen subjects viewed the images in the scanner while performing each of two tasks: categorizing the images as depicting something that looks like an animal or not (task 1) and categorizing the images as depicting a real living animal or a lifeless artefact (task 2).
The authors performed representational similarity analysis, computing representational dissimilarity matrices (RDMs) using the correlation distance (1 – Pearson correlation between spatial response patterns). They averaged representational dissimilarities of the same kind (e.g. between the animal and the corresponding hybrid) across the 9 triads. To compare different kinds of representational distance, they used ANOVAs and t tests to perform inference (treating the subject variable as a random effect). They also studied the representations of the stimuli in the last fully connected layers of two deep neural networks (DNNs; VGG-19, GoogLeNet) trained to classify objects, and in human similarity judgments. For the DNNs and human judgements, they used stimulus bootstrapping (treating the stimulus variable as a random effect) to perform inference.
Results of a series of well-motivated analyses are summarized in Figure 2 below (not in the paper). The most striking finding is that while human judgments and DNN last-layer representations are dominated by the living/nonliving distinction, human ventral temporal cortex (VTC) appears to care more about appearance: the hybrid animal-lookalike objects, despite being lifeless artefacts, fall closer to the animals than to the objects. In addition, the authors find:
Clusters of animals, hybrids, and objects: In VTC, animals, hybrids, and objects form significantly distinct clusters (average within-cluster dissimilarity < average between-cluster dissimilarity for all three pairs of categories). In DNNs and behavioral judgments, by contrast, the hybrids and the objects do not form significantly distinct clusters (but animals form a separate cluster from hybrids and from objects).
Matching of animals to corresponding hybrids: In VTC, the distance between a hybrid animal-lookalike and the corresponding animal is significantly smaller than that between a hydrid animal-lookalike and a non-matching animal. This indicates that VTC discriminates the animals and animal-lookalikes and (at least to some extent) matches the lookalikes to the correct animals. This effect was also present in the similarity judgments and DNNs. However, the latter two similarly matched the hybrids up with their corresponding objects, which was not a significant effect in VTC.
The effect of the categorization task on the VTC representation was subtle or absent, consistent with other recent studies (cf. Nastase et al. 2017, open review). The representation appears to be mostly stimulus driven.
The results of Bracci et al. are consistent with the idea that the ventral stream transforms images into a semantic representation by computing features that are grounded in visual appearance, but correlated with categories (Jozwik et al. 2015). VTC might be 5-10 nonlinear transformations removed from the image. While it may emphasize visual features that help with categorization, it might not be the stage where all the evidence is put together for our final assessment of what we’re looking at. VTC, thus, is fooled by these fun artefacts, and that might be what makes them so charming.
Although this interpretation is plausible enough and straightforward, I am left with some lingering thoughts to the contrary.
What if things were the other way round? Instead of DNNs judging correctly where VTC is fooled, what if VTC had a special ability that the DNNs lack: to see the analogy between the cow and the cow-mug, to map the mug onto the cow? The “visual appearance” interpretation is based on the deceptively obvious assumption that the cow-mug (for example) “looks like” a cow. One might, equally compellingly, argue that it looks like a mug: it’s glossy, it’s conical, it has a handle. VTC, then, does not fail to see the difference between the fake animal and the real animal (in fact these categories do cluster in VTC). Rather it succeeds at making the analogy, at mapping that handle onto the tail of a cow, which is perhaps an example of a cognitive feat beyond current AI.
Bracci et al.’s results are thought-provoking and the study looks set to inspire computational and empirical follow-up research that links vision to cognition and brain representations to deep neural network models.
addresses an important question
elegant design with beautiful stimulus set
well-motivated and comprehensive analyses
interesting and thought-provoking results
two categorization tasks, promoting either the living/nonliving or the animal-appearance/non-animal appearance division
behavioral similarity judgment data
information-based searchlight mapping, providing a broader view of the effects
new data set to be shared with the community
representational geometry analyses, though reasonable, are suboptimal
no detailed analyses of DNN representations (only the last fully connected layers shown, which are not expected to best model the ventral stream) or the degree to which they can explain the VTC representation
only three ROIs (V1, posterior VTC, anterior VTC)
correlation distance used to measure representational distances (making it difficult to assess which individual representational distances are significantly different from zero, which appears important here)
Suggestions for improvement
The analyses are effective and support most of the claims made. However, to push this study from good to excellent, I suggest the following improvements.
Improved representational-geometry analysis
The key representational dissimilarities needed to address the questions of this study are labeled a-g in Figure 2. It would be great to see these seven quantities estimated, tested for deviation from 0, and all 7 choose 2 = 21 pairwise comparisons tested. This would address which distinctions are significant and enable addressing all the questions with a consistent approach, rather than combining many qualitatively different statistics (including clustering index, identity index, and model RDM correlation).
With the correlation distance, this would require a split-data RDM approach, consistent with the present approach, but using the repeated response measurements to the same stimulus to estimate and remove the positive bias of the correlation-distance estimates. However, a better approach would be to use a crossvalidated distance estimator (more details below).
Multidimensional scaling (MDS) to visualize representational geometries
This study has 27 unique stimuli, a number well suited for visualization of the representational geometries by MDS. To appreciate the differences between the triads (each of which has unique features), it would be great to see an MDS of all 27 objects and perhaps also MDS arrangements of subsets, e.g. each triad or pairs of triads (so as to reduce distortions due to dimensionality reduction).
Most importantly, the key representational dissimilarities a-g can be visualized in a single MDS as shown in Figure 2 above, using two triads to illustrate the triad-averaged representational geometry (showing average within- and between-triad distances among the three types of object). The MDS could use 2 or 3 dimensions, depending on which variant better visually conveys the actual dissimilarity estimates.
Crossvalidated distance estimators
The correlation distance is not an ideal dissimilarity measure because a large correlation distance does not indicate that two stimuli are distinctly represented. If a region does not respond to either stimulus, for example, the correlation of the two patterns (due to noise) will be close to 0 and the correlation distance will be close to 1, a high value that can be mistaken as indicating a decodable stimulus pair.
Crossvalidated distances such as the linear-discriminant t value (LD-t; Kriegeskorte et al. 2007, Nili et al. 2014) or the crossnobis distance (also known as the linear discriminant contrast, LDC; Walther et al. 2016) would be preferable. Like decoding accuracy, they use crossvalidation to remove bias (due to overfitting) and indicate that the two stimuli are distinctly encoded. Unlike decoding accuracy, they are continuous and nonsaturating, which makes them more sensitive and a better way to characterize representational geometries.
Since the LD-t and the crossnobis distance estimators are symmetrically distributed about 0 under the null hypothesis (H0: response patterns drawn from the same distribution), it would be straightforward to test these distances (and averages over sets of them) for deviation from 0, treating subjects and/or stimuli as random effects, and using t tests, ANOVAs, or nonparametric alternatives. Comparing different dissimilarities or set-average dissimilarities is similarly straightforward.
Linear crossdecoding with generalization across triads
An additional analysis that would give complementary information is linear decoding of categorical divisions with generalization across stimuli. A good approach would be leave-one-triad-out linear classification of:
living versus nonliving
things that look like animals versus other things
animal-lookalikes versus other things
animals versus animal-lookalikes
animals versus objects
animal-lookalikes versus objects
This might work for devisions that do not show clustering (within dissimilarity < between dissimilarity), which would indicate linear separability in the absence of compact clusters.
For the living/nonliving destinction, for example, the linear discriminant would select responses that are not confounded by animal-like appearance (as most VTC responses seem to be), responses that distinguish living things from animal-lookalike objects. This analysis would provide a good test of the existence of such responses in VTC.
More layers of the two DNNs
To assess the hypothesis that VTC computes features that are more visual than semantic with DNNs, it would be useful to include an analysis of all the layers of each of the two DNNs, and to test whether weighted combinations of layers can explain the VTC representational geometry (cf. Khaligh-Razavi & Kriegeskorte 2014).
How do these effects look in V2, V4, LOC, FFA, EBA, and PPA?
The use of the term “bias” in the abstract and main text is nonstandard and didn’t make sense to me. Bias only makes sense when we have some definition of what the absence of bias would mean. Similarly the use of “veridical” in the abstract doesn’t make sense. There is no norm against which to judge veridicality.
The polar plots are entirely unmotivated. There is no cyclic structure or even meaningful order to the the 9 triads.
“DNNs are very good, and even better than than human visual cortex, at identifying a cow-mug as being a mug — not a cow.” This is not a defensible claim for several reasons, each of which by itself suffices to invalidate this.
fMRI does not reveal all the information in cortex.
VTC is not all of visual cortex.
VTC does cluster animals separately from animal-lookalikes and from objects.
Linear readout of animacy (cross-validated across triads) might further reveal that the distinction is present (even if it is not dominant in the representational geometry.
“how an object looks like” -> ‘how an object looks” or “what an object looks like”
The orientation of a visual grating can be decoded from fMRI response patterns in primary visual cortex (Kamitani & Tong 2005, Haynes & Rees 2005). This was surprising because fMRI voxels in these studies are 3 mm wide in each dimension and thus average over many columns of neurons that respond to different orientations. Since then, many studies have sought to clarify why fMRI orientation decoding works so well.
The first explanation given was that even though much of the contrast of the neuronal orientation signals might cancel out in the averaging within each voxel, any given voxel might retain a slight bias toward certain orientations if it didn’t sample all the columns exactly equally (Kamitani & Tong 2005, Boynton 2005). By integrating the evidence across many slightly biased voxels with a linear decoder, it should then be possible to guess, better than chance, the orientation of the stimulus.
Later work explored how random orientation biases might arise in the voxels. If each voxel directly sampled the orientation columns (computing an average within its cuboid boundaries), then decoding success should be very sensitively dependent on the alignment of the voxels between training and test sets. A shift of the voxel grid on the scale of the width of an orientation column would change the voxel biases and abolish decoding success. Several groups have argued that the biases might arise at the level of the vasculature (Gardner et al. 2009, Kriegeskorte et al. 2009). This would make the biases enabling orientation decoding less sensitive to slight shifts of the voxel grid. Moreover, if voxels reflected signals sampled through the fine-grained vasculature, then it would be the vasculature, not the voxel grid that determines to what extent different spatial frequencies of the underlying neuronal activity patterns are reflected in the fMRI patterns (Kriegeskorte et al. 2009).
Another account (Op de Beeck 2010, Freeman et al. 2011) proposed that decoding may rely exclusively on coarse-scale spatial patterns of activity. In particular, Freeman Brouwer, Heeger and Merriam (2011) argued that radial orientations (those aligned with a line that passes through the point of fixation) are over-represented in the neural population. If this were the case, then a grating would elicit a coarse-scale response pattern across its representation in V1, in which the neurons representing edges pointing (approximately) at fixation are more strongly active. There is indeed evidence from multiple studies for a nonuniform representation of orientations in V1 (Furmanski & Engel 2000, Sasaki et al., 2006, Serences et al. 2009, Mannion et al. 2010), perhaps reflecting the nonuniform probability distribution of orientation in natural visual experience. The over-representation of radial orientations might help explain the decodability of gratings. However, opposite-sense spirals (whose orientations are balanced about the radial orientation) are also decodable (Mannion et al. 2009, Alink et al. 2013). This might be due to a simultaneous over-representation of vertical orientations (Freeman et al. 2013, but see Alink et al. 2013).
There’s evidence in favor of a contribution to orientation decoding of both coarse-scale (Op de Beeck 2010, Freeman et al. 2011, Freeman et al. 2013) and fine-scale components of the fMRI patterns (e.g. Shmuel et al. 2010, Swisher et al. 2010, Alink et al. 2013, Pratte et al. 2016, Alink et al. 2017).
Note that both coarse-scale and fine-scale pattern accounts suggest that voxels have biases in favor of certain orientations. A entirely novel line of argument was introduced to the debate by Carlson (2014).
Carlson (2014) argued, on the basis of simulation results, that even if every voxel sampled a set of filters uniformly representing all orientations (i.e. without any bias), the resulting fMRI patterns could still reflect the orientation of a grating confined to a circular annulus (as standardly used in the literature). The reason lies in “the interaction between the stimulus region and the empty background” (Carlson 2014), an effect of the relative orientations of the grating and the edge of the aperture (the annulus within which the grating is visible). Carlson’s simulations showed that the average response of a uniform set of Gabor orientation filters is larger where the aperture edge is orthogonal to the grating. He also showed that the effect does not depend on whether the aperture edge is hard or soft (fading contrast). Because the voxels in this account have no biases in favor of particular orientations, Carlson aptly referred to his account as an “unbiased” perspective.
The aperture edge adds edge energy. The effect is strongest when the edge is orthogonal to the carrier grating orientation. We can understand this in terms of the Fourier spectrum. Whereas a sine grating has a concentrated representation in the 2D Fourier amplitude spectrum, the energy is more spread out when an aperture limits the extent of the grating, with the effect depending on the relative orientations of grating and edge.
For an intuition on how this kind of thing can happen, consider a particularly simple scenario, where a coarse rectangular grating is limited by a sharp aperture whose edge is orthogonal to the grating. V1 cells with small receptive fields will respond to the edge itself as well as to the grating. When edge and grating are orthogonal, the widest range of orientation-selective V1 cells is driven. However, the effect is present also for sinusoidal gratings and soft apertures, where contrast fades gradually, e.g. according to a raised half-cosine.
An elegant new study by Roth, Heeger, and Merriam (pp2018) now follows up on the idea of Carlson (2014) with fMRI at 3T and 7T. Roth et al. refer to the interaction between the edge and the content of the aperture as “vignetting” and used apertures composed of either multiple annuli or multiple radial rays. These finer-grained apertures spread the vignetting effect all throughout the stimulated portion of the visual field and so are well suited to demonstrate the effect on fMRI patterns.
Roth et al. present simulations (Figure 1), following Carlson (2014) and assuming that every voxel uniformly samples all orientations. They confirm Carlson’s account and show that the grating stimuli the group used earlier in Freeman et al. (2011) are expected to produce the stronger response to radial parts of the grating, where the aperture edge is orthogonal to the grating — even without any over-representation of radial orientations by the neurons.
Freeman et al. (2011) used a relatively narrow annulus (inner edge: 4.5º, outer edge: 9.5º eccentricity from fixation), where no part of the grating is far from the edge. This causes the vignetting effect to create the appearance of a radial bias that is strongest at the edges but present even in the central part of the annular aperture (Figure 1, bottom right). Roth et al.’s findings suggest that the group’s earlier result might reflect vignetting, rather than (or in addition to) a radial bias of the V1 neurons.
Roth et al. use simulations also to show that their new stimuli, in which the aperture consists of multiple annuli or multiple radial rays, predict coarse-scale patterns across V1. They then demonstrate in single subjects measured with fMRI at 3T and 7T that V1 responds with the globally modulated patterns predicted by the account of Carlson (2014).
The study is beautifully designed and expertly executed. Results compellingly demonstrate that, as proposed by Carlson (2014), vignetting can account for the coarse-scale biases reported in Freeman et al. (2011). The paper also contains a careful discussion that places the phenomenon in a broader context. Vignetting describes a family of effects related to aperture edges and their interaction with the contents of the aperture. The interaction could be as simple as the aperture edge adding edge energy of a different orientation and thus changing orientation-selective response. It could also involve extra-receptive-field effects such as non-isotropic surround suppression.
The study leaves me with two questions:
Is the radial orientation-preference map in V1, as described in Freeman et al. (2011), entirely an artefact of vignetting (or is there still also an over-representation of radial orientations in the neuronal population)?
Does vignetting also explain fMRI orientation signals in studies that use larger oriented gratings, where much of the grating is further from the edge of the aperture, as in Kamitani & Tong (2005)?
The original study by Kamitani and Tong (2005) used a wider annular aperture reaching further into the central region, where receptive fields are smaller (inner edge: 1.5°, outer edge: 10° eccentricity from fixation). The interior parts of the stimulus may therefore not be affected by vignetting. Importantly, Wardle, Ritchie, Seymour, and Carlson (2017) already investigated this issue and their results suggest that vignetting is not necessary for orientation decoding.
It would be useful to analyze the stimuli used by Kamitani & Tong (2005) with a Gabor model (with reasonable choices for the filter sizes). As a second step, it would be good to reanalyze the data from Kamitani & Tong (2005), or from a similar design. The analysis should focus on small contiguous ROIs in V1 of the left and right hemisphere that represent regions of the visual field far from the edge of the aperture.
Going forward, perhaps we can pursue the issue in the spirit of open science. We would acquire fMRI data with maximally large gratings, so that regions unaffected by vignetting can be analyzed (Figure 2). The experiments should include localizers for the aperture margins (transparent blue) and for ROIs perched on the horizontal meridian far from the aperture edges (transparent red). The minimal experiment would contain two grating orientations (45º and -45º as shown at the bottom), each presented with many different phases. Note that, for the ROIs shown in Figure 2, these two orientations mimimize undesired voxel biases due to radial and vertical orientation preferences (both gratings have equal angle to the radial orientation and equal angle to the vertical orientation). Note also that these two orientations have equal angle to the aperture edge, thus also minimizing any residual long-range vignetting effect that acts across the safety margin.
The analysis of the ROIs should follow Alink et al. (2017): In each ROI (left hemisphere, right hemisphere), we use a training set of fMRI runs to define two sets of voxels: 45º-preferring and -45º-preferring voxels. We then use the test set of fMRI runs to check, independently for the two voxel sets, whether the preferences replicate. We could implement a sensitive test along these lines by training and testing a linear decoder on just the 45º-preferring voxels, and then another linear decoder on just the -45º-preferring voxels. If both of these decoders have significant accuracy on the test set, we have established that voxels of opposite selectivity intermingle within the same small ROI, indicating fine-grained pattern information.
A more comprehensive experiment would contain perhaps 8 or 16 equally spaced orientations and a range of spatial frequencies balanced about the spatial frequency that maximally drives neurons at the eccentricity of the critical ROIs (Henriksson et al. 2008).
More generally, a standardized experiment along these lines would constitute an excellent benchmark for comparing fMRI acquisition schemes in terms of the information they yield about neuronal response patterns. Such a benchmark would lend itself to comparing different spatial resolutions (0.5 mm, 1 mm, 2 mm, 3 mm), different fMRI sequences, and different field strengths (3T, 7T) across different sites and scanner models. The tradeoffs involved (notably between functional contrast to noise and partial volume sampling) are difficult to estimate without directly testing each fMRI acquisition scheme for the information it yields (Formisano & Kriegeskorte 2012). A standard pattern-information benchmark for fMRI could therefore be really useful, especially if pursued as an open-science project (shared stimuli and presentation protocol, shared fMRI data, contributor coauthorships on the first three papers using someone’s openly shared components).
Glad we sorted this out. Who’s up for collaborating?
Time to go to bed.
Well-motivated and elegant experimental design and analysis
3T and 7T fMRI data from a total of 14 subjects
Compelling results demonstrating that vignetting can cause coarse-scale patterns that enable orientation decoding
The paper claims to introduce a novel idea that requires reinterpretation of a large literature. The claim of novelty is unjustified. Vignetting was discovered by Carlson et al. (2014) and in Wardle et al. (2017), Carlson’s group showed that it may be one, but not the only contributing factor enabling orientation decoding. Carlson et al. deserve clearer credit throughout.
The experiments show that vignetting compromised the stimuli of Freeman et al. (2011), but they don’t address whether the claim by Freeman et al. of an over-representation of radial orientations in the neuronal population holds regardless.
The paper doesn’t attempt to address whether decoding is still possible in the absence of vignetting effects, i.e. far from the aperture boundary.
Particular comments and suggestions
While the experiments and analyses are excellent and the paper well written, the current version is compromised by some exaggerated claims, suggesting greater novelty and consequence than is appropriate. This should be corrected.
“Here, we show that a large body of research that purported to measure orientation tuning may have in fact been inadvertently measuring sensitivity to second-order changes in luminance, a phenomenon we term ‘vignetting’.” (Abstract)
“Our results demonstrate that stimulus vignetting can wholly determine the orientation selectivity of responses in visual cortex measured at a macroscopic scale, and suggest a reinterpretation of a well-established literature on orientation processing in visual cortex.” (Abstract)
“Our results provide a framework for reinterpreting a wide-range
of findings in the visual system.” (Introduction)
Too strong of a claim of novelty. The effect beautifully termed “vignetting” here was discovered by Carlson (2014), and that study deserves the credit for triggering a reevaluation of the literature, which began four years ago. The present study does place vignetting in a broader context, discussing a variety of mechanisms by which aperture edges might influence responses, but the basic idea, including that the key factor is the interaction between the edge and the grating orientation and that the edge need not be hard, are all introduced in Carlson (2014). The present study very elegantly demonstrates the phenomenon with fMRI, but the effect has also previously been studied with fMRI by Wardle et al. (2017), so the fMRI component doesn’t justify this claim, either. Finally, while results compellingly show that vignetting was a strong contributor in Freeman et al. (2011), they don’t show that it is the only contributing factor for orientation decoding. In particular, Wardle et al. (2017) suggests that vignetting in fact is not necessary for orientation decoding.
“We and others, using fMRI, discovered a coarse-scale orientation bias in human V1; each voxel exhibits an orientation preference that depends on the region of space that it represents (Furmanski and Engel, 2000; Sasaki et al., 2006; Mannion et al., 2010; Freeman et al., 2011; Freeman et al., 2013; Larsson et al., 2017). We observed a radial bias in the peripheral representation of V1: voxels that responded to peripheral locations near the vertical meridian tended to respond most strongly to vertical orientations; voxels along the peripheral horizontal meridian responded most strongly to horizontal orientations; likewise for oblique orientations. This phenomenon had gone mostly unnoticed previously. We discovered this striking phenomenon with fMRI because fMRI covers the entire retinotopic map in visual cortex, making it an ideal method for characterizing such coarse-scale representations.” (Introduction)
A bit too much chest thumping. The radial-bias phenomenon was discovered by Sasaki et al. (2006). Moreover, the present study negates the interpretation in Freeman et al. (2011). Freeman et al. (2011) interpreted their results as indicating an over-representation of radial orientations in cortical neurons. According to the present study, the results were in fact an artifact of vignetting and whether neuronal biases played any role is questionable. Freeman et al. used a narrower annulus than other studies (e.g. Kamitani & Tong, 2005), so may have been more susceptible to the vignetting artifact. The authors suggest that a large literature be reinterpreted, but apparently not their own study for which they specifically and compellingly show how vignetting probably affected it.
“A leading conjecture is that the orientation preferences in fMRI measurements arise primarily from random spatial irregularities in the fine-scale columnar architecture (Boynton, 2005; Haynes and Rees, 2005; Kamitani and Tong, 2005). […] On the other hand, we have argued that the coarse-scale orientation bias is the predominant orientation-selective signal measured with fMRI, and that multivariate decoding analysis methods are successful because of it (Freeman et al., 2011; Freeman et al., 2013). This conjecture remains controversial because the notion that fMRI is sensitive to fine-scale neural activity is highly attractive, even though it has been proven difficult to validate empirically (Alink et al., 2013; Pratte et al., 2016; Alink et al., 2017).” (Introduction)
This passage is a bit biased. First, the present results question the interpretation of Freeman et al. (2011). While the authors’ new interpretation (following Carlson, 2014) also suggests a coarse-scale contribution, it fundamentally changes the account. Moreover, the conjecture that coarse-scale effects play a role is not controversial. What is controversial is the claim that only coarse-scale effects contribute to fMRI orientation decoding. This extreme view is controversial not because it is attractive to think that fMRI can exploit fine-grained pattern information, but because the cited studies (Alink et al. 2013, Pratte et al. 2016, Alink et al. 2017, and additional studies, including Shmuel et al. 2010 and Swisher et al. 2010) present evidence in favor of a contribution from fine-grained patterns. The way the three studies are cited would suggest to an uninformed reader that they provide evidence against a contribution from fine-grained patterns. More evenhanded language is in order here.
“the model we use is highly simplified; for example, it does not take into account changes in spatial frequency tuning at greater eccentricities. Yet, despite the multiple sources of noise and the simplified assumptions of the model, the correspondence between the model’s prediction and the empirical measurements are highly statistically significant. From this, we conclude that stimulus vignetting is a primary source of the course[sic] scale bias.”
This argument is not compelling. A terrible model may explain a portion of the explainable variance that is minuscule, yet highly statistically significant. In the absence of inferential comparisons among multiple models and model checking (or a noise ceiling), better to avoid such claims.
“One study (Alink et al., 2017) used inner and outer circular annuli, but added additional angular edges, the result of which should be a combination of radial and tangential biases. Indeed, this study reported that voxels had a mixed pattern of selectivity, with a considerable number of voxels reliably preferring tangential gratings, and other voxels reliably favoring radial orientations.” (Discussion)
It’s true that the additional edges between the patches (though subtle) complicate the interpretation of the results of Alink et al. (2017). It would be good to check the strength of the effect by simulation. Happy to share the stimuli if someone wanted to look into this.
Figure 4A, legend: Top and bottom panels mislabeled as showing angular and radial modulator results, respectively.
Deep convolutional neural networks can label images with object categories at superhuman levels of accuracy. Whether they are as robust to noise and distortions as human vision, however, is an open question.
Geirhos, Janssen, Schütt, Rauber, Bethge, and Wichmann (pp2017) compared humans and deep convolutional neural networks in terms of their ability to recognize 16 object categories under different levels of noise and distortion. They report that human vision is substantially more robust to these modifications.
Psychophysical experiments were performed in a controlled lab environment. Human observers fixated a central square at the start of each trial. Each image was presented for 200 ms (3×3 degrees of visual angle), followed by a pink noise mask (1/f spectrum) of 200-ms duration. This type of masking is thought to minimize recurrent computations in the visual system. The authors, thus, stripped human vision of the option to scrutinize the image and focused the comparison on what human vision achieves through the feedforward sweep of processing (although some local recurrent signal flow likely still contributed). Observers then clicked on one of 16 icons to indicate the category of the stimulus.
The figure below shows the levels of additive uniform noise (left) and local distortion (right) that were necessary to reduce the accuracy of each system to about 50% (classifying among 16 categories). Careful analyses across levels of noise and distortion show that the deep nets perform similarly to the human observers at low levels of noise or distortion. Both humans and deep nets approach chance level performance at very high levels of distortion. However, human performance degrades much more gracefully, beating deep nets when the image is compromised to an intermediate degree.
This is careful and important work that helps characterize how current models still fall short. The authors are making their substantial lab-acquired human behavioral data set openly available. This is great, because the data can be analyzed by other researchers in both brain science and computer science.
What the study does not quite deliver is an explanation of why the deep nets fall short. Is it something about the convolutional feedforward architecture that renders the models less robust? Does human vision employ normalization or adaptive filtering operations that enable it to “see through” the noise and distortion, e.g. by focusing on features less affected by the artefacts?
Humans have massive experience with noisy viewing conditions, such as those arising in bad weather. We also have much experience seeing things distorted, through water, or glass that is not perfectly plane. Moreover, peripheral vision may rely on summary-statistical descriptions that may be somewhat robust to the kinds of distortion used in this study.
To assess whether it is visual experience or something about the architecture that causes the networks to be less robust, I suggest that the networks be trained with noisy and/or distorted images. Data augmentation with noise and distortion may help deep nets learn more robust internal representations for vision.
Careful human psychophysical measurements of classification accuracy for 16 categories for a large set of stimuli (40K categorization trials).
Detailed comparisons between human performance and performance of three popular deep net architectures (AlexNet, GoogLeNet, VGG-16).
Substantial behavioral data set shared with the community.
Network architectures not trained with noise and distortion rendering ambiguous whether the deep nets’ lack of robustness is due to architecture or training.
Data are not used to evaluate the three models overall in terms of their ability to capture patterns of confusions.
Human-machine comparisons focus on overall accuracy under noise and distortion, and on category-level confusions, rather than the processing of particular images.
Suggestions for improvements
(1) Train deep nets with noise and distortion. Humans experience noise and distortions as part of their visual world. Would the networks perform better if they were trained with noisy and distorted images? The authors could train the networks (or at least VGG-16) with some image set (nonoverlapping with the images used in the psychophysics) and augment the training set with noisy and distorted variants. This would help clarify to what extent training can improve robustness and to what extent the architecture is the limiting factor.
(2) Evaluate each model’s overall ability to predict human patterns of confusions. The confusion matrix analyses shed some light on the differences between humans and models. However, it would be good to assess which model’s confusions are most similar to the humans overall. To this end one could consider the offdiagonal elements of the confusion matrix (to render the analysis complementary to the analyses of overall accuracy) and statistically compare the models in terms of their ability to explain patterns of confusions. The offdiagonal entries only could be compared by correlation (or 0-fixed correlation).
(1) “adversarial examples have cast some doubt on the idea of broad-ranging manlike DNN behavior. For any given image it is possible to perturb it minimally in a principled way such that DNNs mis-classify it as belonging to an arbitrary other category (Szegedy et al., 2014). This slightly modified image is then called an adversarial example, and the manipulation is imperceptible to human observers (Szegedy et al., 2014).”
This point is made frequently, although it is not compelling. Any learner uses an inductive bias to infer a model from data. In general, combining the prior (inductive bias) and the data will not yield perfect decision boundaries. An omniscient adversary can always place an example in the misrepresented region of the input space. Adversarial examples are therefore a completely expected phenomenon for any learning algorithm, whether biological or artificial. The misrepresented volume may have infinitesimal probability mass under natural conditions. A visual system could therefore perform perfectly in the real world — until confronted with an omniscient adversary that backpropagates through its brain to fool it. No one knows if adversarial examples can also be constructed for human brains. If so, they might similarly require only slight modifications imperceptible to other observers.
The bigger point that neural networks fall short of human vision in terms of their robustness is almost certainly true, of course. To make that point on the basis of adversarial examples, however, would requires considering the literature on black-box attacks that do not rely on omniscient knowledge of the system to be fooled or its training set. It would also require applying these much less efficient methods symmetrically to human subjects.
(2) “One might argue that human observers, through experience and evolution, were exposed to some image distortions (e.g. fog or snow) and therefore have an advantage over current DNNs. However, an extensive exposure to eidolon-type distortions seems exceedingly unlikely. And yet, human observers were considerably better at recognising eidolon-distorted objects, largely unaffected by the different perceptual appearance for different eidolon parameter combinations (reach, coherence). This indicates that the representations learned by the human visual system go beyond being trained on certain distortions as they generalise towards previously unseen distortions. We believe that achieving such robust representations that generalise towards novel distortions are the key to achieve robust deep neural network performance, as the number of possible distortions is literally unlimited.”
This is not a very compelling argument because the space of “previously unseen distortions” hasn’t been richly explored here. Moreover, the Eidolon-distortions are in fact motivated by the idea that they retain information similar to that retained by peripheral vision. They, thus, discard information that the human visual system is well trained to do without in the periphery.
(3) On the calculation of DNNs’ accuracies for the 16 categories: “Since all investigated DNNs, when shown an image, output classification predictions for all 1,000 ImageNet categories, we disregarded all predictions for categories that were not mapped to any of the 16 entry-level categories. Amongst the remaining categories, the entry-level category corresponding to the ImageNet category with the highest probability (top-1) was selected as the network’s response.”
It would seem to make more sense to add up the probabilities of the ImageNet categories corresponding to each of the 16 entry-level categories and use the resulting 16 totals to pick the predicted basic-level category. Alternatively, one may train a new softmax layer with 16 outputs. Please clarify which method was used and how it relates to the other methods.
Thanks to Tal Golan for sharing his comments on this paper with me.
Realistic models of the primate visual system have many millions of parameters. A vision model needs substantial capacity to store the required knowledge about what things look like. Brain activity data are costly, so typically do not suffice to set the parameters of these models. Recent progress has benefited from direct learning of the required knowledge from category-labeled image sets. Nevertheless further fitting with brain-activity data is required to learn about the relative prevalence of the different computational features (and of linear combinations of the features) in each cortical area and to accurately predict representations of novel images (not used in setting model parameters).
Each individual brain is unique. A key challenge is to hold on to what we’ve learned by fitting a visual encoding model to one subject exposed to one set of images when we move on to new experiments. Traditionally, we make inferences about the computational mechanisms with a given data set and hold on to those abstract insights, e.g. that model ResNet beats model AlexNet at predicting ventral visual responses. Ideally, we would be able to hold on to more detailed parametric information learned on one data set as we move on to other data sets.
Wen, Shi, Chen & Liu (pp2017) develop a Bayesian approach to learning encoding models (linear combinations of the features of deep neural networks) incrementally across subjects and stimulus sets. The initial model is fitted with a 0-mean prior on the weights (L2 penalty). The resulting encoding model for each fMRI voxel has a Gaussian posterior over the weights for each feature of the deep net model. The Gaussian posterior is assumed to be isotropic, avoiding the need for a separate variance parameter for each feature (let alone a full covariance matrix).
The results are compelling. Using the posteriors inferred from previous subjects as priors for new subjects substantially increases a model’s prediction performance. This is consistent with the observation that models generalize quite well to new subjects, even without subject-specific fitting. Importantly, the transfer of the weight knowledge from one subject to the next works even when using different stimulus sets in different subjects.
This work takes a first step in the direction of the exciting possibility of incremental learning of complex models across hundreds or thousands of subjects and millions of stimuli (acquired in labs around the world).
It is interesting to consider the implementation of the inference procedure. Although Bayesian in motivation, the implementation uses L2 penalities for deviation of the weights wv from the previous weights estimate wv0 and from zero. The respective penalty factors α and λ are determined by crossvalidation so as to best predict the new data. This procedure makes a lot of sense. However, it is a bit at a tension with a pure Bayesian approach in two ways: (1) In a pure Bayesian approach, the previous data set should determine the width of the posterior, which becomes the prior for the next data set. Here the width of the prior is adjusted (via α) to optimize prediction performance. (2) In a pure Bayesian approach, the 0-mean prior would be absorbed into the first model’s posterior and would not enter into into the inference again with every update of the posterior with new data.
The cost function for predicting the response profile vector rv (# stimuli by 1) for fMRI voxel v from deep net feature responses F (# stimuli by # features) is:
While the crossvalidation procedure makes sense for optimizing prediction accuracy on the present data set, I wonder if it is optimal in the bigger picture of integrating the knowledge across many studies. The present data set will reflect only a small portion of stimulus space and one subject, so should not get to downweight a prior based on much more comprehensive data.
Addresses an important challenge and suggests exciting potential for big-data learning of computational models across studies and labs.
Presents a straightforward and well-motivated method for incremental learning of encoding model weights across studies with different subjects and different stimuli.
Results are compelling: Using the prior information helps the performance of an encoding model a lot when the training data for the new subject is limited.
The posterior over the weights vector is modeled as isotropic. It would be good to allow different degrees of certainty for different features and, better yet, to model the dependencies between the weights of different features. (However, such richer models might be challenging to estimate in practice.)
The prior knowledge transferred from previous studies consists only in the MAP estimate of the weight vector for each voxel.
The method assumes that a precise intersubject spatial-correspondence mapping is given. Such mappings might not exist and are costly to approximate with functional data.
Suggestions for improvement
(1) Explore and/or discuss if a prior with feature-specific variance might be feasible. Explore whether inferring a posterior distribution over weights using a mean weight vector and feature-specific variances brings even better results. I guess this is hard when there are millions of features.
(2) Consider dropping the assumption that a precise correspondence mapping is given and infer a multinormal posterior over local weight vectors. The model assumes that we have a precise intersubject spatial-correspondence mapping (from cortical alignment based either on anatomical or functional data). It seems more versatile and statistically preferable not to rely on a precise (i.e. voxel-to-voxel) correspondence mapping, but to simultaneously address the correspondence and incremental weight-learning problem. We could assume that an imprecise correspondence mapping is given. For corresponding brain locations in the previous and current subject (subjects 1 and 2), subject-1 encoding models within a small spherical region around the target location could be used to define a prior for fitting an encoding model to the target voxel for subject 2. Such a prior should be a probability distribution over weight vectors, which could be characterized by the second moment of the weight vector distribution. Regularization, such as optimal shrinkage to a diagonal target or (when there are too many features) simply the assumption that the second moment is diagonal could be used to make this approach feasible. In either case, the goal would be to pool the posterior distributions across voxels within the small sphere and summarize the resulting distribution (e.g. as a multinormal). I realize that this might be beyond the scope of the current study. It is not a requirement for this paper.
(3) Clarify the terminology used for the estimation procedures. What is referred to as “maximum likelihood estimation” uses an L2 penalty on the weights, amounting to Bayesian inference of the weights with a 0-mean Gaussian prior. This is not a maximum likelihood estimator. Please correct this (or explain in case I am mistaken).
(4) Consider how to ensure that the prior has an appropriate width (and the prior evidence thus appropriate weight). Should a more purely Bayesian approach be taken, where the width of the posterior is explicitly inferred and becomes the width of the prior? Should the crossvalidation setting of the hyperparameters use a very varied test set to prevent the current (possibly narrowly specialized) data set from being given too much weight? Should the amount of data contributing to the prior model and the amount of data in the present set (and optionally the noise level) be used to determine the relative weighting?
Wen, Shi, Chen, and Liu (pp2017) used a deep residual neural network (trained on visual object classification) as an encoding model to explain human cortical fMRI responses to movies. The deep net together with the encoding weights of the cortical voxels was then used to predict human cortical response patterns to 64K object images from 80 categories. This prediction serves, not to validate the model, but to investigate how cortical patterns (as predicted by the model) reflect the categorical hierarchy.
The authors report that the predicted category-average response patterns fall into three clusters corresponding to natural superordinate categories: biological things, nonbiological things, and scenes. They argue that these superordinate categories characterize the large-scale organization of human visual cortex.
For each of the three superordinate categories, the authors then thresholded the average predicted activity pattern and investigated the representational geometry within the supra-threshold volume. They find that biological things elicit patterns (within the subvolume responsive to biological things) that fall into four subclusters: humans, terrestrial animals, aquatic animals, and plants. Patterns in regions activated by scenes clustered into artificial and natural scenes. The patterns in regions activated by non-biological things did not reveal clear subdivisions.
The authors argue that this shows that superordinate categories are represented in global patterns across higher visual cortex, and finer-grained categorical distinctions are represented in finer-grained patterns within regions responding to superordinate categories.
This is an original, technically sophisticated, and inspiring paper. However, the title claim is not compellingly supported by the evidence. The fact that finer grained distinctions become apparent in pattern correlation matrices after restricting the volume to voxels responsive to a given category is not evidence for an association between brain-spatial scales and conceptual scales. To understand this, consider the fact that the authors’ analyses do not take the spatial positions of the voxels (and thus the spatial structure) into account at all. The voxel coordinates could be randomly permuted and the analyses would give the same results.
The original global representational dissimilarity (or similarity) matrices likely contain distinctions not only at the superordinate level, but also at finer-grained levels (as previously shown). When pattern correlation is used, these divisions might not be prominent in the matrices because the component shared among all exemplars within a superordinate category dominates. Recomputing the pattern correlation matrix after reducing the patterns to voxels responding strongly to a given superordinate category will render the subdivisions within the superordinate categories more prominent. This results from the mean removal implicit to the pattern correlation, which will decorrelate patterns that share high responses on many of the included voxels. Such a result does not indicate that the subdivisions were not present (e.g. significantly decodable from fMRI or even clustered) in the global patterns.
A simple way to take spatial structure into account would be to restrict the analysis to a single spatially contiguous cluster at a time, e.g. FFA. This is in fact the approach taken in a large number of previous studies that investigated the representations in category-selective regions (LOC, FFA, PPA, RSC, etc.). Another way would be to spatially filter the patterns and investigate whether finer semantic distinctions are associated with finer spatial scales. This approach has also been used in previous studies, but can be confounded by the presence of an unknown pattern of voxel gains (Freeman et al. 2013; Alink et al. 2017, Scientific Reports).
The approach of creating a deep net model that explains the data and then analyzing the model instead of the data is a very interesting idea, but also raises some questions. Clearly we need deep nets with millions of parameters to understand visual processing. If a deep net explains visual responses throughout the visual system and shares at least some architectural similarities with the visual hierarchy, then it is reasonable to assume that it might capture aspects of the computational mechanism of vision. In a sense, we have “uploaded” aspects of the mechanism of vision into the model, whose workings we can more efficiently study. This is always subject to consideration of alternative models whose architecture might better match what is known about the primate visual system and which might predict visual responses even better. Despite this caveat, I believe that developing deep net models that explain visual responses and studying their computational mechanisms is a promising approach in general.
In the present context, however, the goal is to relate conceptual levels of categories to spatial scales of cortical response patterns, which can be directly measured. Is the deep net really needed to address this? To study how categories map onto cortex, why not just directly study measured response patterns? This is fact is what the existing literature has done for years. The deep net functions as a fancy interpolator that imputes data where we have none (response patterns for 64K images). However, the 80 category-average response patterns could have been directly measured. Would this not be more compelling? It would not require us to believe that the deep net is an accurate model.
Although the authors have gotten off to a fresh start on the intriguing questions of the spatial organization of higher-level visual cortex, the present results do not yet go significantly beyond what is known and the novel and interesting methods introduced in the paper (perhaps the major contribution) raise a number of questions that should be addressed in a revision.
Presents several novel and original ideas for the use of deep neural net models to understand the visual cortex.
Uses 50-layer ResNet model as encoding model and shows that this model performs better than the simpler AlexNet model.
Tests deep net models trained on movie data for generalization to other movie data and prediction of responses in category-selective-region localizer experiments.
Attempts to address the interesting hypothesis that larger scales of cortical organization serve to represent larger conceptual scales of categorical representation.
The analyses are implemented at a high level of technical sophistication.
The central claim about spatial structure of cortical representations is not supported by evidence about the spatial structure. In fact, analyses are invariant to the spatial structure of the cortical response patterns.
Unclear what added value is provided by the deep net for addressing the central claim that larger spatial scales in the brain are associated with larger conceptual scales.
Uses a definition of “modularity” from network theory to analyze response pattern similarity structure, which will confuse cognitive scientists and cognitive neuroscientists to whom modularity is a computational and brain-spatial notion. Fails to resolve the ambiguities and confusions pervading the previous literature (“nested hierarchy”, “module”).
Follows the practice in cognitive neuroscience of averaging response patterns elicited by exemplars of each category, although the deep net predicts response patterns for individual images. This creates ambiguity in the interpretation of the results.
The central concepts modularity and semantic similarity are not properly defined, either conceptually or in terms of the mathematical formulae used to measure them.
The BOLD fMRI measurements are low in resolution with isotropic voxels of 3.5 mm width.
Suggestions for improvements
(1) Analyze to what extent different spatial scales in cortex reflect information about different levels of categorization (or change the focus of the paper)
The ResNet encoding model is interesting from a number of perspectives, so the focus of the paper does not have to be on the association of spatial cortical and conceptual scales. If the paper is to make claims about this difficult, but important question, then analyses should explicitly target the spatial structure of cortical activity patterns.
The current analyses are invariant to where responses are located in cortex and thus fundamentally cannot address to what extent different categorical levels are represented at different spatial scales. While the ROIs (Figure 8a) show prominent spatial clustering, this doesn’t go beyond previous studies and doesn’t amount to showing a quantitative relationship.
The emergence of subdivisions within the regions driven by superordinate-category images could be entirely due to the normalization (mean removal) implicit to the pattern correlation. Similar subdivisions could exist in the complementary set of voxels unresponsive to the superordinate category, and/or in the global patterns.
Note that spatial filtering analyses might be interesting, but are also confounded by gain-field patterns across voxels. Previous studies have struggled to address this issue; see Alink et al. (2017, Scientific Reports) for a way to detect fine-grained pattern information not caused by a fine-grained voxel gain field.
(2) Analyze measured response patterns during movie or static-image presentation directly, or better motivate the use of the deep net for this purpose
The question how spatial scales in cortex relate to conceptual scales of categories could be addressed directly by measuring activity patterns elicited by different images (or categories) with fMRI. It would be possible, for instance, to measure average response patterns to the 80 categories. In fact previous studies have explored comparably large sets of images and categories.
Movie fMRI data could also be used to address the question of the spatial structure of visual response patterns (and how it relates to semantics), without the indirection of first training a deep net encoding model. For example, the frames of the movies could be labeled (by a human or a deep net) and measured response patterns could directly be analyzed in terms of their spatial structure.
This approach would circumvent the need to train a deep net model and would not require us to trust that the deep net correctly predicts response patterns to novel images. The authors do show that the deep net can predict patterns for novel images. However, these predictions are not perfect and they combine prior assumptions with measurements of response patterns. Why not drop the assumptions and base hypothesis tests directly on measured response patterns?
In case I am missing something and there is a compelling case for the approach of going through the deep net to address this question, please explain.
(3) Use clearer terminology
Module: The term module refers to a functional unit in cognitive science (Fodor) and to a spatially contiguous cortical region that corresponds to a functional unit in cognitive neuroscience (Kanwisher). In the present paper, the term is used in the sense of network theory. However it is applied not to a set of cortical sites on the basis of their spatial proximity or connectivity (which would be more consistent with the meaning of module in cognitive neuroscience), but to a set of response patterns on the basis of their similarity. A better term for this is clustering of response patterns in the multivariate response space.
Nested hierarchy: I suspect that by “nested” the authors mean that there are representations within the subregions responding to each of the superordinate categories and that by “hierarchy” they refer to the levels of spatial inclusion. However, the categorical hierarchy also corresponds to clusters and subclusters in response-pattern space, which could similarly be considered a “nested hierarchy”. Finally, the visual system is often characterized as a hierarchy (referring to the sequence of stages of ventral-stream processing). The paper is not sufficiently clear about these distinctions. In addition, terms like “nested hierarchy” have a seductive plausibility that belies their lack of clear definition and the lack of empirical evidence in favor of any particular definition. Either clearly define what does and does not constitute a “nested hierarchy” and provide compelling evidence in favor of it, or drop the concept.
(4) Define indices measuring “modularity” (i.e. response-pattern clustering) and semantic similarity
You cite papers on the Q index of modularity and the LCH semantic similarity index. These indices are central to the interpretation of the results, so the reader should not have to consult the literature to determine how they are mathematically defined.
(5) Clarify results on semantic similarity
The correlation between LCH semantic similarity and cortical pattern correlation is amazing (r=0.93). Of course this has a lot to do with the fact that LCH takes a few discrete values and cortical similarity was first averaged within each LCH value.
What is the correlation between cortical pattern similarity and semantic similarity…
for each of the layers of ResNet before remixing to predict human fMRI responses?
after remixing to predict human fMRI responses for each of a number of ROIs (V1-3, LOC, FFA, PPA)?
for other, e.g. word-co-occurrence-based, semantic similarity measures (e.g. word2vec, latent semantic analysis)?
(6) Clarify the methods details
I didn’t understand all the methods details.
How were the layer-wise visual feature sets defined? Was each layer refitted as an encoding model? Or were the weights from the overall encoding model used, but other layers omitted?
I understand that the sub-divisions of the three superordinate categories were defined by k-means clustering and that the Q index (which is not defined in the paper) was used. How was the number k of clusters determined? Was k chosen to maximize the Q index?
How were the category-associated cortical regions defined, i.e. how was the threshold chosen?
(7) Cite additional previous studies
Consider discussing the work of Lorraine Tyler’s lab on semantic representations and Thomas Carlson’s paper on semantic models for explaining similarity structure in visual cortex (Carlson et al. 2013, Journal of Cognitive Neuroscience).