Humans recognize objects with greater robustness to noise and distortions than deep nets

[I7R8]

Deep convolutional neural networks can label images with object categories at superhuman levels of accuracy. Whether they are as robust to noise and distortions as human vision, however, is an open question.

Geirhos, Janssen, Schütt, Rauber, Bethge, and Wichmann (pp2017) compared humans and deep convolutional neural networks in terms of their ability to recognize 16 object categories under different levels of noise and distortion. They report that human vision is substantially more robust to these modifications.

Psychophysical experiments were performed in a controlled lab environment. Human observers fixated a central square at the start of each trial. Each image was presented for 200 ms (3×3 degrees of visual angle), followed by a pink noise mask (1/f spectrum) of 200-ms duration. This type of masking is thought to minimize recurrent computations in the visual system. The authors, thus, stripped human vision of the option to scrutinize the image and focused the comparison on what human vision achieves through the feedforward sweep of processing (although some local recurrent signal flow likely still contributed). Observers then clicked on one of 16 icons to indicate the category of the stimulus.

The figure below shows the levels of additive uniform noise (left) and local distortion (right) that were necessary to reduce the accuracy of each system to about 50% (classifying among 16 categories). Careful analyses across levels of noise and distortion show that the deep nets perform similarly to the human observers at low levels of noise or distortion. Both humans and deep nets approach chance level performance at very high levels of distortion. However, human performance degrades much more gracefully, beating deep nets when the image is compromised to an intermediate degree.

ScreenShot2386 — **Figure: At what level of noise and distortion does recognition break down in each system?** Additive noise (left) or Eidolon distortion (right) was ramped up, so as to reduce classification accuracy to 50% for a given system. To cause human performance to drop to 50% accuracy (for classification among 16 categories), substantially higher levels of noise or distortion were required (top row). Modified version of Fig. 4 of the paper.

This is careful and important work that helps characterize how current models still fall short. The authors are making their substantial lab-acquired human behavioral data set openly available. This is great, because the data can be analyzed by other researchers in both brain science and computer science.

What the study does not quite deliver is an explanation of why the deep nets fall short. Is it something about the convolutional feedforward architecture that renders the models less robust? Does human vision employ normalization or adaptive filtering operations that enable it to “see through” the noise and distortion, e.g. by focusing on features less affected by the artefacts?

Humans have massive experience with noisy viewing conditions, such as those arising in bad weather. We also have much experience seeing things distorted, through water, or glass that is not perfectly plane. Moreover, peripheral vision may rely on summary-statistical descriptions that may be somewhat robust to the kinds of distortion used in this study.

To assess whether it is visual experience or something about the architecture that causes the networks to be less robust, I suggest that the networks be trained with noisy and/or distorted images. Data augmentation with noise and distortion may help deep nets learn more robust internal representations for vision.

Strengths

Careful human psychophysical measurements of classification accuracy for 16 categories for a large set of stimuli (40K categorization trials).
Detailed comparisons between human performance and performance of three popular deep net architectures (AlexNet, GoogLeNet, VGG-16).
Substantial behavioral data set shared with the community.

Weaknesses

Network architectures not trained with noise and distortion rendering ambiguous whether the deep nets’ lack of robustness is due to architecture or training.
Data are not used to evaluate the three models overall in terms of their ability to capture patterns of confusions.
Human-machine comparisons focus on overall accuracy under noise and distortion, and on category-level confusions, rather than the processing of particular images.

Suggestions for improvements

(1) Train deep nets with noise and distortion. Humans experience noise and distortions as part of their visual world. Would the networks perform better if they were trained with noisy and distorted images? The authors could train the networks (or at least VGG-16) with some image set (nonoverlapping with the images used in the psychophysics) and augment the training set with noisy and distorted variants. This would help clarify to what extent training can improve robustness and to what extent the architecture is the limiting factor.

(2) Evaluate each model’s overall ability to predict human patterns of confusions. The confusion matrix analyses shed some light on the differences between humans and models. However, it would be good to assess which model’s confusions are most similar to the humans overall. To this end one could consider the offdiagonal elements of the confusion matrix (to render the analysis complementary to the analyses of overall accuracy) and statistically compare the models in terms of their ability to explain patterns of confusions. The offdiagonal entries only could be compared by correlation (or 0-fixed correlation).

Minor comments

(1) “adversarial examples have cast some doubt on the idea of broad-ranging manlike DNN behavior. For any given image it is possible to perturb it minimally in a principled way such that DNNs mis-classify it as belonging to an arbitrary other category (Szegedy et al., 2014). This slightly modified image is then called an adversarial example, and the manipulation is imperceptible to human observers (Szegedy et al., 2014).”

This point is made frequently, although it is not compelling. Any learner uses an inductive bias to infer a model from data. In general, combining the prior (inductive bias) and the data will not yield perfect decision boundaries. An omniscient adversary can always place an example in the misrepresented region of the input space. Adversarial examples are therefore a completely expected phenomenon for any learning algorithm, whether biological or artificial. The misrepresented volume may have infinitesimal probability mass under natural conditions. A visual system could therefore perform perfectly in the real world — until confronted with an omniscient adversary that backpropagates through its brain to fool it. No one knows if adversarial examples can also be constructed for human brains. If so, they might similarly require only slight modifications imperceptible to other observers.

The bigger point that neural networks fall short of human vision in terms of their robustness is almost certainly true, of course. To make that point on the basis of adversarial examples, however, would requires considering the literature on black-box attacks that do not rely on omniscient knowledge of the system to be fooled or its training set. It would also require applying these much less efficient methods symmetrically to human subjects.

(2) “One might argue that human observers, through experience and evolution, were exposed to some image distortions (e.g. fog or snow) and therefore have an advantage over current DNNs. However, an extensive exposure to eidolon-type distortions seems exceedingly unlikely. And yet, human observers were considerably better at recognising eidolon-distorted objects, largely unaffected by the different perceptual appearance for different eidolon parameter combinations (reach, coherence). This indicates that the representations learned by the human visual system go beyond being trained on certain distortions as they generalise towards previously unseen distortions. We believe that achieving such robust representations that generalise towards novel distortions are the key to achieve robust deep neural network performance, as the number of possible distortions is literally unlimited.”

This is not a very compelling argument because the space of “previously unseen distortions” hasn’t been richly explored here. Moreover, the Eidolon-distortions are in fact motivated by the idea that they retain information similar to that retained by peripheral vision. They, thus, discard information that the human visual system is well trained to do without in the periphery.

(3) On the calculation of DNNs’ accuracies for the 16 categories: “Since all investigated DNNs, when shown an image, output classification predictions for all 1,000 ImageNet categories, we disregarded all predictions for categories that were not mapped to any of the 16 entry-level categories. Amongst the remaining categories, the entry-level category corresponding to the ImageNet category with the highest probability (top-1) was selected as the network’s response.”

It would seem to make more sense to add up the probabilities of the ImageNet categories corresponding to each of the 16 entry-level categories and use the resulting 16 totals to pick the predicted basic-level category. Alternatively, one may train a new softmax layer with 16 outputs. Please clarify which method was used and how it relates to the other methods.

–Nikolaus Kriegeskorte

Thanks to Tal Golan for sharing his comments on this paper with me.

Incremental Bayesian learning of visual encoding models across subjects exposed to different stimuli

[I7R8]

Realistic models of the primate visual system have many millions of parameters. A vision model needs substantial capacity to store the required knowledge about what things look like. Brain activity data are costly, so typically do not suffice to set the parameters of these models. Recent progress has benefited from direct learning of the required knowledge from category-labeled image sets. Nevertheless further fitting with brain-activity data is required to learn about the relative prevalence of the different computational features (and of linear combinations of the features) in each cortical area and to accurately predict representations of novel images (not used in setting model parameters).

Each individual brain is unique. A key challenge is to hold on to what we’ve learned by fitting a visual encoding model to one subject exposed to one set of images when we move on to new experiments. Traditionally, we make inferences about the computational mechanisms with a given data set and hold on to those abstract insights, e.g. that model ResNet beats model AlexNet at predicting ventral visual responses. Ideally, we would be able to hold on to more detailed parametric information learned on one data set as we move on to other data sets.

Wen, Shi, Chen & Liu (pp2017) develop a Bayesian approach to learning encoding models (linear combinations of the features of deep neural networks) incrementally across subjects and stimulus sets. The initial model is fitted with a 0-mean prior on the weights (L2 penalty). The resulting encoding model for each fMRI voxel has a Gaussian posterior over the weights for each feature of the deep net model. The Gaussian posterior is assumed to be isotropic, avoiding the need for a separate variance parameter for each feature (let alone a full covariance matrix).

The results are compelling. Using the posteriors inferred from previous subjects as priors for new subjects substantially increases a model’s prediction performance. This is consistent with the observation that models generalize quite well to new subjects, even without subject-specific fitting. Importantly, the transfer of the weight knowledge from one subject to the next works even when using different stimulus sets in different subjects.

This work takes a first step in the direction of the exciting possibility of incremental learning of complex models across hundreds or thousands of subjects and millions of stimuli (acquired in labs around the world).

It is interesting to consider the implementation of the inference procedure. Although Bayesian in motivation, the implementation uses L2 penalities for deviation of the weights w_v from the previous weights estimate w_v⁰ and from zero. The respective penalty factors α and λ are determined by crossvalidation so as to best predict the new data. This procedure makes a lot of sense. However, it is a bit at a tension with a pure Bayesian approach in two ways: (1) In a pure Bayesian approach, the previous data set should determine the width of the posterior, which becomes the prior for the next data set. Here the width of the prior is adjusted (via α) to optimize prediction performance. (2) In a pure Bayesian approach, the 0-mean prior would be absorbed into the first model’s posterior and would not enter into into the inference again with every update of the posterior with new data.

The cost function for predicting the response profile vector r_v (# stimuli by 1) for fMRI voxel v from deep net feature responses F (# stimuli by # features) is:

While the crossvalidation procedure makes sense for optimizing prediction accuracy on the present data set, I wonder if it is optimal in the bigger picture of integrating the knowledge across many studies. The present data set will reflect only a small portion of stimulus space and one subject, so should not get to downweight a prior based on much more comprehensive data.

Strengths

Addresses an important challenge and suggests exciting potential for big-data learning of computational models across studies and labs.
Presents a straightforward and well-motivated method for incremental learning of encoding model weights across studies with different subjects and different stimuli.
Results are compelling: Using the prior information helps the performance of an encoding model a lot when the training data for the new subject is limited.

Weaknesses

The posterior over the weights vector is modeled as isotropic. It would be good to allow different degrees of certainty for different features and, better yet, to model the dependencies between the weights of different features. (However, such richer models might be challenging to estimate in practice.)
The prior knowledge transferred from previous studies consists only in the MAP estimate of the weight vector for each voxel.
The method assumes that a precise intersubject spatial-correspondence mapping is given. Such mappings might not exist and are costly to approximate with functional data.

Suggestions for improvement

(1) Explore and/or discuss if a prior with feature-specific variance might be feasible. Explore whether inferring a posterior distribution over weights using a mean weight vector and feature-specific variances brings even better results. I guess this is hard when there are millions of features.

(2) Consider dropping the assumption that a precise correspondence mapping is given and infer a multinormal posterior over local weight vectors. The model assumes that we have a precise intersubject spatial-correspondence mapping (from cortical alignment based either on anatomical or functional data). It seems more versatile and statistically preferable not to rely on a precise (i.e. voxel-to-voxel) correspondence mapping, but to simultaneously address the correspondence and incremental weight-learning problem. We could assume that an imprecise correspondence mapping is given. For corresponding brain locations in the previous and current subject (subjects 1 and 2), subject-1 encoding models within a small spherical region around the target location could be used to define a prior for fitting an encoding model to the target voxel for subject 2. Such a prior should be a probability distribution over weight vectors, which could be characterized by the second moment of the weight vector distribution. Regularization, such as optimal shrinkage to a diagonal target or (when there are too many features) simply the assumption that the second moment is diagonal could be used to make this approach feasible. In either case, the goal would be to pool the posterior distributions across voxels within the small sphere and summarize the resulting distribution (e.g. as a multinormal). I realize that this might be beyond the scope of the current study. It is not a requirement for this paper.

(3) Clarify the terminology used for the estimation procedures. What is referred to as “maximum likelihood estimation” uses an L2 penalty on the weights, amounting to Bayesian inference of the weights with a 0-mean Gaussian prior. This is not a maximum likelihood estimator. Please correct this (or explain in case I am mistaken).

(4) Consider how to ensure that the prior has an appropriate width (and the prior evidence thus appropriate weight). Should a more purely Bayesian approach be taken, where the width of the posterior is explicitly inferred and becomes the width of the prior? Should the crossvalidation setting of the hyperparameters use a very varied test set to prevent the current (possibly narrowly specialized) data set from being given too much weight? Should the amount of data contributing to the prior model and the amount of data in the present set (and optionally the noise level) be used to determine the relative weighting?

Do coarser spatial patterns represent coarser categories in visual cortex?

[I7R5]

Wen, Shi, Chen, and Liu (pp2017) used a deep residual neural network (trained on visual object classification) as an encoding model to explain human cortical fMRI responses to movies. The deep net together with the encoding weights of the cortical voxels was then used to predict human cortical response patterns to 64K object images from 80 categories. This prediction serves, not to validate the model, but to investigate how cortical patterns (as predicted by the model) reflect the categorical hierarchy.

The authors report that the predicted category-average response patterns fall into three clusters corresponding to natural superordinate categories: biological things, nonbiological things, and scenes. They argue that these superordinate categories characterize the large-scale organization of human visual cortex.

For each of the three superordinate categories, the authors then thresholded the average predicted activity pattern and investigated the representational geometry within the supra-threshold volume. They find that biological things elicit patterns (within the subvolume responsive to biological things) that fall into four subclusters: humans, terrestrial animals, aquatic animals, and plants. Patterns in regions activated by scenes clustered into artificial and natural scenes. The patterns in regions activated by non-biological things did not reveal clear subdivisions.

The authors argue that this shows that superordinate categories are represented in global patterns across higher visual cortex, and finer-grained categorical distinctions are represented in finer-grained patterns within regions responding to superordinate categories.

This is an original, technically sophisticated, and inspiring paper. However, the title claim is not compellingly supported by the evidence. The fact that finer grained distinctions become apparent in pattern correlation matrices after restricting the volume to voxels responsive to a given category is not evidence for an association between brain-spatial scales and conceptual scales. To understand this, consider the fact that the authors’ analyses do not take the spatial positions of the voxels (and thus the spatial structure) into account at all. The voxel coordinates could be randomly permuted and the analyses would give the same results.

The original global representational dissimilarity (or similarity) matrices likely contain distinctions not only at the superordinate level, but also at finer-grained levels (as previously shown). When pattern correlation is used, these divisions might not be prominent in the matrices because the component shared among all exemplars within a superordinate category dominates. Recomputing the pattern correlation matrix after reducing the patterns to voxels responding strongly to a given superordinate category will render the subdivisions within the superordinate categories more prominent. This results from the mean removal implicit to the pattern correlation, which will decorrelate patterns that share high responses on many of the included voxels. Such a result does not indicate that the subdivisions were not present (e.g. significantly decodable from fMRI or even clustered) in the global patterns.

A simple way to take spatial structure into account would be to restrict the analysis to a single spatially contiguous cluster at a time, e.g. FFA. This is in fact the approach taken in a large number of previous studies that investigated the representations in category-selective regions (LOC, FFA, PPA, RSC, etc.). Another way would be to spatially filter the patterns and investigate whether finer semantic distinctions are associated with finer spatial scales. This approach has also been used in previous studies, but can be confounded by the presence of an unknown pattern of voxel gains (Freeman et al. 2013; Alink et al. 2017, Scientific Reports).

The approach of creating a deep net model that explains the data and then analyzing the model instead of the data is a very interesting idea, but also raises some questions. Clearly we need deep nets with millions of parameters to understand visual processing. If a deep net explains visual responses throughout the visual system and shares at least some architectural similarities with the visual hierarchy, then it is reasonable to assume that it might capture aspects of the computational mechanism of vision. In a sense, we have “uploaded” aspects of the mechanism of vision into the model, whose workings we can more efficiently study. This is always subject to consideration of alternative models whose architecture might better match what is known about the primate visual system and which might predict visual responses even better. Despite this caveat, I believe that developing deep net models that explain visual responses and studying their computational mechanisms is a promising approach in general.

In the present context, however, the goal is to relate conceptual levels of categories to spatial scales of cortical response patterns, which can be directly measured. Is the deep net really needed to address this? To study how categories map onto cortex, why not just directly study measured response patterns? This is fact is what the existing literature has done for years. The deep net functions as a fancy interpolator that imputes data where we have none (response patterns for 64K images). However, the 80 category-average response patterns could have been directly measured. Would this not be more compelling? It would not require us to believe that the deep net is an accurate model.

Although the authors have gotten off to a fresh start on the intriguing questions of the spatial organization of higher-level visual cortex, the present results do not yet go significantly beyond what is known and the novel and interesting methods introduced in the paper (perhaps the major contribution) raise a number of questions that should be addressed in a revision.

ScreenShot2230 — Figure: ResNet provides a better basis for human-fMRI voxel encoding models than AlexNet.

Strengths

Presents several novel and original ideas for the use of deep neural net models to understand the visual cortex.
Uses 50-layer ResNet model as encoding model and shows that this model performs better than the simpler AlexNet model.
Tests deep net models trained on movie data for generalization to other movie data and prediction of responses in category-selective-region localizer experiments.
Attempts to address the interesting hypothesis that larger scales of cortical organization serve to represent larger conceptual scales of categorical representation.
The analyses are implemented at a high level of technical sophistication.

Weaknesses

The central claim about spatial structure of cortical representations is not supported by evidence about the spatial structure. In fact, analyses are invariant to the spatial structure of the cortical response patterns.
Unclear what added value is provided by the deep net for addressing the central claim that larger spatial scales in the brain are associated with larger conceptual scales.
Uses a definition of “modularity” from network theory to analyze response pattern similarity structure, which will confuse cognitive scientists and cognitive neuroscientists to whom modularity is a computational and brain-spatial notion. Fails to resolve the ambiguities and confusions pervading the previous literature (“nested hierarchy”, “module”).
Follows the practice in cognitive neuroscience of averaging response patterns elicited by exemplars of each category, although the deep net predicts response patterns for individual images. This creates ambiguity in the interpretation of the results.
The central concepts modularity and semantic similarity are not properly defined, either conceptually or in terms of the mathematical formulae used to measure them.
The BOLD fMRI measurements are low in resolution with isotropic voxels of 3.5 mm width.

Suggestions for improvements

(1) Analyze to what extent different spatial scales in cortex reflect information about different levels of categorization (or change the focus of the paper)

The ResNet encoding model is interesting from a number of perspectives, so the focus of the paper does not have to be on the association of spatial cortical and conceptual scales. If the paper is to make claims about this difficult, but important question, then analyses should explicitly target the spatial structure of cortical activity patterns.

The current analyses are invariant to where responses are located in cortex and thus fundamentally cannot address to what extent different categorical levels are represented at different spatial scales. While the ROIs (Figure 8a) show prominent spatial clustering, this doesn’t go beyond previous studies and doesn’t amount to showing a quantitative relationship.

The emergence of subdivisions within the regions driven by superordinate-category images could be entirely due to the normalization (mean removal) implicit to the pattern correlation. Similar subdivisions could exist in the complementary set of voxels unresponsive to the superordinate category, and/or in the global patterns.

Note that spatial filtering analyses might be interesting, but are also confounded by gain-field patterns across voxels. Previous studies have struggled to address this issue; see Alink et al. (2017, Scientific Reports) for a way to detect fine-grained pattern information not caused by a fine-grained voxel gain field.

(2) Analyze measured response patterns during movie or static-image presentation directly, or better motivate the use of the deep net for this purpose

The question how spatial scales in cortex relate to conceptual scales of categories could be addressed directly by measuring activity patterns elicited by different images (or categories) with fMRI. It would be possible, for instance, to measure average response patterns to the 80 categories. In fact previous studies have explored comparably large sets of images and categories.

Movie fMRI data could also be used to address the question of the spatial structure of visual response patterns (and how it relates to semantics), without the indirection of first training a deep net encoding model. For example, the frames of the movies could be labeled (by a human or a deep net) and measured response patterns could directly be analyzed in terms of their spatial structure.

This approach would circumvent the need to train a deep net model and would not require us to trust that the deep net correctly predicts response patterns to novel images. The authors do show that the deep net can predict patterns for novel images. However, these predictions are not perfect and they combine prior assumptions with measurements of response patterns. Why not drop the assumptions and base hypothesis tests directly on measured response patterns?

In case I am missing something and there is a compelling case for the approach of going through the deep net to address this question, please explain.

(3) Use clearer terminology

Module: The term module refers to a functional unit in cognitive science (Fodor) and to a spatially contiguous cortical region that corresponds to a functional unit in cognitive neuroscience (Kanwisher). In the present paper, the term is used in the sense of network theory. However it is applied not to a set of cortical sites on the basis of their spatial proximity or connectivity (which would be more consistent with the meaning of module in cognitive neuroscience), but to a set of response patterns on the basis of their similarity. A better term for this is clustering of response patterns in the multivariate response space.

Nested hierarchy: I suspect that by “nested” the authors mean that there are representations within the subregions responding to each of the superordinate categories and that by “hierarchy” they refer to the levels of spatial inclusion. However, the categorical hierarchy also corresponds to clusters and subclusters in response-pattern space, which could similarly be considered a “nested hierarchy”. Finally, the visual system is often characterized as a hierarchy (referring to the sequence of stages of ventral-stream processing). The paper is not sufficiently clear about these distinctions. In addition, terms like “nested hierarchy” have a seductive plausibility that belies their lack of clear definition and the lack of empirical evidence in favor of any particular definition. Either clearly define what does and does not constitute a “nested hierarchy” and provide compelling evidence in favor of it, or drop the concept.

(4) Define indices measuring “modularity” (i.e. response-pattern clustering) and semantic similarity

You cite papers on the Q index of modularity and the LCH semantic similarity index. These indices are central to the interpretation of the results, so the reader should not have to consult the literature to determine how they are mathematically defined.

(5) Clarify results on semantic similarity

The correlation between LCH semantic similarity and cortical pattern correlation is amazing (r=0.93). Of course this has a lot to do with the fact that LCH takes a few discrete values and cortical similarity was first averaged within each LCH value.

What is the correlation between cortical pattern similarity and semantic similarity…

for each of the layers of ResNet before remixing to predict human fMRI responses?
after remixing to predict human fMRI responses for each of a number of ROIs (V1-3, LOC, FFA, PPA)?
for other, e.g. word-co-occurrence-based, semantic similarity measures (e.g. word2vec, latent semantic analysis)?

(6) Clarify the methods details

I didn’t understand all the methods details.

How were the layer-wise visual feature sets defined? Was each layer refitted as an encoding model? Or were the weights from the overall encoding model used, but other layers omitted?
I understand that the sub-divisions of the three superordinate categories were defined by k-means clustering and that the Q index (which is not defined in the paper) was used. How was the number k of clusters determined? Was k chosen to maximize the Q index?
How were the category-associated cortical regions defined, i.e. how was the threshold chosen?

(7) Cite additional previous studies

Consider discussing the work of Lorraine Tyler’s lab on semantic representations and Thomas Carlson’s paper on semantic models for explaining similarity structure in visual cortex (Carlson et al. 2013, Journal of Cognitive Neuroscience).

A brief overview of classification models in vision science

[I5R6]

Majaj & Pelli (pp2017) give a brief overview of classification models in vision science, leading from linear discriminants and the perceptron to deep neural networks. They discuss some of the perks and perils of using machine learning, and deep learning in particular, in the study of biological vision.

This is a brief and light-footed review that will be of interest to vision scientists wondering whether and why to engage machine learning and deep learning in their own work. I enjoyed some of the thoughtful notes on the history of classification models and the sketch of the progression toward modern deep learning.

The present draft lists some common arguments for and against deep learning models, but falls short of presenting a coherent perspective on why deep learning is important for vision science, or not; or which aspects are substantial and which are hype. It also doesn’t really explain deep learning or how it relates to the computational challenge of vision.

The overall conclusion is that machine learning and deep learning are useful modern tools for the vision scientist. In particular, the authors argue that deep neural networks provide a “benchmark” to compare human performance to, replacing the optimal linear filter and signal detection theory as the normative benchmark for vision. This misses what I would argue is the bigger point: deep neural networks provide an entry point for modeling brain information processing and engaging the real problem of vision, rather than a toy version of the problem that lacks all of vision’s essential challenges.

Suggestions for improvements

(1) Clearly distinguish deep learning within machine learning

The abstract doesn’t mention deep learning at all. As I was reading the introduction, I was wondering if deep learning had been added to the title of a paper about machine learning in vision science at the very end. Deep learning is defined as “the latest version of machine learning”. This is incorrect. Rather than a software product that is updated in a sequence of versions, machine learning is a field that explores a wide variety of models and inference algorithms in parallel. The fact that deep learning (which refers to learning of deep neural network models) is getting a lot of attention at the moment does not mean that other approaches, notably Bayesian nonparametric models, have lost appeal. How is deep learning different? Does it matter more for vision than other approaches? If so, why?

(2) Explain why depth matters

The multiple stages of nonlinear transformation that define deep learning models are essential for many real-world applications, including vision. I think this point should be central as it explains why vision science needs deep models.

(3) Clearly distinguish the use of machine learning models to (a) analyze data and to (b) model brain information processing

The current draft largely fails to distinguish two ways of using machine learning in vision science: to analyze data (e.g. decode neuronal population codes) and to model brain information processing. Both are important, but the latter more fundamentally advances the field.

(4) Relate classification to machine learning more broadly and to vision

The present draft presents a brief history of classification models. Classification is a small (though perhaps arguably key?) problem within both machine learning and vision. Why is this particular problem the focus of such a large literature and of this review? How does it relate to other problems in machine learning and in vision?

(5) Separate the substance from the hype and present a coherent perspective

Arguments for and against deep learning are listed without evaluation or a coherent perspective. For example, is it true that deep learning models have “too many parameters”? Should we strive to model vision with a handful of parameters? Or do models need to be complex because vision requires complex domain knowledge? Do tests of generalization performance address the issue of overfitting? (No, no, yes, yes.) Note that the modern version of the statistical modeling, which is touted as more rigorous, is Bayesian nonparametrics – defined by no limits on the parametric complexity of a model.

(6) Consider addressing my particular comments below.

Particular comments

“Many perception scientists try to understand recognition by living organisms. To them, machine learning offers a reference of attainable performance based on learned stimuli.”

It’s not really a normative reference. There is an infinity of neural network models and performance of a particular one can never be claimed to be “ideal”. Deep learning is worse in this respect than the optimal linear filter (which provides a normative reference for a task – with the caveat that the task is not vision).

“Deep learning is the latest version of machine learning, distinguished by having more than three layers.”

It’s not the “latest version”, rather it’s an old variant of machine learning that is currently very successful and popular. Also, a better definition of deep is that there is more than one hidden layer intervening between input and output layers.

“It is ubiquitous in the internet.”

How is this relevant?

“Machine learning shifts the emphasis from how the cells encode to what they encode, i.e. from how they encode the stimulus to what that code tells us about the stimulus. Mapping a receptive field is the foundation of neuroscience (beginning with Weber’s 1834/1996 mapping of tactile “sensory circles”), but many young scientists are impatient with the limitations of single-cell recording: looking for minutes or hours at how one cell responds to each of perhaps a hundred different stimuli. New neuroscientists are the first generation for whom it is patently clear that characterization of a single neuron’s receptive field, which was invaluable in the retina and V1, fails to characterize how higher visual areas encode the stimulus. Statistical learning techniques reveal “how neuronal responses can best be used (combined) to inform perceptual decision-making” (Graf, Kohn, Jazayeri, & Movshon, 2010).”

This is an important passage. It’s true that single neurons in inferior temporal cortex, for example, might be (a) difficult to characterize singly with tuning functions, (b) idiosyncratic to a particular animal, and (c) so many in number and variety that characterizing them one by one seems hopeless. It therefore appears more productive to focus on understanding the population code. However, it is not only what is encoded in the population, but also how it is encoded. The format determines what inferences are easy given the code. For example, we can ask what information could be gleaned by a single downstream neuron computing a linear or radial-basis-function readout of the code.

“For psychophysics, Signal Detection Theory (SDT) proved that the optimal classifier for a signal in noise is a template matcher (Peterson, Birdsall, & Fox, 1954; Tanner & Birdsall, 1958).”

Detecting chihuahuas in complex scenes can be considered an example of detecting “signal in noise”, and it is an example of a visual task. A template matcher is certainly not optimal for this problem (in fact it will fail severely at this problem). It would help here to define signal and noise.

The problem of detecting a fixed pattern in Gaussian noise needs to be explained first in any course of vision, so as to inoculate students against the misconstrual of the problem of vision it represents. On a more conciliatory note, one could argue that although detecting a fixed pattern in noise is a misleading oversimplification of vision, it captures a component of the problem. The optimal solution to this problem, template matching, captures a component of the solution to vision. Deep feedforward neural networks could be described as hierarchical template matchers, and they do seem to capture some aspects of vision.

“SDT has been a very useful reference in interpreting human psychophysical performance (e.g. Geisler, 1989; Pelli et al., 2006). However, it provides no account of learning. Machine learning shows promise of guiding today’s investigations of human learning and may reveal the constraints imposed by the training set on learning.”

In addition to offering learning algorithms that might relate to how brains learn, machine learning enables us to use realistically complex models at all.

“It can be hard to tell whether behavioral performance is limited by the set of stimuli, or the neural representation, or the mismatch between the neural decision process and the stimulus and task. Implications for classification performance are not readily apparent from direct inspection of families of stimuli and their neural responses.”

Intriguing, but cryptic. Please clarify.

“Some biologists complain that neural nets do not match what we know about neurons (Crick, 1989; Rubinov, 2015).”

It is unclear how the ideal “match” should even be defined. All models abstract, and that is their purpose. Stating a feature of biology that is absent in the model does not amount to a valid criticism. But there is a more detailed case to be made for incorporating more biologically realistic dynamic components, so please elaborate.

“In particular, it is not clear, given what we know about neurons and neural plasticity, whether a backpropagation network can be implemented using biologically plausible circuits (but see Mazzoni et al., 1991, and Bengio et al., 2015).”

Neural net models can be good models of perception without being good models of learning. There has also been a recent resurgence in work exploring how backpropagation, or a closely related form of credit assignment, might be implemented in brains. Please discuss the work along these lines by Senn, Richards, Bogacz, and Bengio.

“Some biological modelers complain that neural nets have alarmingly many parameters. Deep neural networks continue to be opaque”

Why are many parameters “alarming” from the more traditional perspective on modeling? Do you feel that the alarm is justified? My view is that the history of AI has shown that intelligence requires rich domain knowledge. Simple models therefore will not be able to explain brain information processing. Machine learning has taught us how to learn complex models and avoid their pitfalls (overfitting).

“Some statisticians worry that rigorous statistical tools are being displaced by machine learning, which lacks rigor (Friedman, 1998; Matloff, 2014, but see Breiman, 2001; Efron & Hastie, 2016).”

The classical simple models can’t cut it, so their rigour doesn’t help us. Machine learning has boldly engaged complex models as are required for AI and brain science. To be able to do this, it initially took a pragmatic computational, rather than a formal probabilistic approach. However, machine learning and statistics have since grown together in many ways, providing a very general perspective on probabilistic inference that combines complexity and rigor.

“It didn’” (p. 9) Fragment.

“Unproven convexity. A problem is convex if there are no local minima other than the global minimum.”

I think this is not true. Here’s my current understanding: If a problem is convex, then any local minimum is the global minimum. This is convenient for optimization and provably not the case for neural networks. However, the reverse implication does not hold: if every local minimum is a global minimum, the function is not necessarily convex. There is a category of cost functions that are not convex, but every local minimum is a global minimum. Neural networks appear to fall in this category (at least under certain conditions that tend to hold in practice).

Note that there can be multiple global minima. In fact, the error function of a neural network over the weight domain typically has many symmetries, with any given set of weights having many computationally equivalent twins (i.e. the model computes the same overall function for different parameter settings). The high dimensionality, however, is not a curse, but a blessing for gradient descent: In a very high-dimensional weight space, it is unlikely that we find ourselves trapped, with the error surface rising in all directions. There are too many directions to escape in. Several papers have argued that local minima are not an issue for deep learning. In particular, it has been argued that every local minimum is a global minimum and that every other critical point is a saddle point, and that saddle points are the real challenge. Moreover, deep nets with sufficient parameters can fit the training data perfectly (interpolating), while generalizing well (which, surprisingly, some people find surprising). There is also evidence that stochastic gradient descent finds flat minima corresponding to robust solutions.

ScreenShot2204 — Example of a non-convex error function whose every local minimum is a global minimum (Dauphin et al. pp2014).

“This [convexity] guarantees that gradient-descent will converge to the global minimum. As far as we know, classifiers that give inconsistent results are not useful.”

That doesn’t follow. A complex learner, such as an animal or neural net model, with idiosyncratic and stochastic initialization and experience may converge to an idiosyncratic solution that is still “useful” – for example, classifying with high accuracy and a small proportion of idiosyncratic errors.

“Conservation of a solution across seeds and algorithms is evidence for convexity.”

No, but it may be evidence for a minimum with a large basin of attraction. Would need to define what counts as conservation of a solution: (1) identical weights, (2) computationally equivalent weights (same input-output mapping). Definition 2 seems more helpful and relevant.

““Adversarial” examples have been presented as a major flaw in deep neural networks. These slightly doctored images of objects are misclassified by a trained network, even though the doctoring has little effect on human observers. The same doctored images are similarly misclassified by several different networks trained with the same stimuli (Szegedy, et al., 2013). Humans too have adversarial examples. Illusions are robust classification errors. […] The existence of adversarial examples is intrinsic to classifiers trained with finite data, whether biological or not.”

I agree. We will know whether humans, too, are susceptible to the type of adversarial example described in the cited paper, as soon as we manage to backpropagate through the human visual system so as to construct comparable adversarial examples for humans.

“SDT solved detection and classification mathematically, as maximum likelihood. It was the classification math of the sixties. Machine learning is the classification math of today. Both enable deeper insight into how biological systems classify. In the old days we used to compare human and ideal classification performance. Today, we can also compare human and machine learning.”

“…the performance of current machine learning algorithms is a useful benchmark”

SDT is classification math for linear models, ML is classification math for more complex models. These models enable us to tackle the real problem of vision. Rather than comparing human performance to a normative ideal of performance on a toy task, we can use deep neural networks to model the brain information processing underlying visual recognition. We can evaluate the models by comparing their internal representations to brain representations and their behavior to human behavior, including not only the ways they shine, but also the ways they stumble and fail.

Recurrent neural net model trained on 20 classical primate decision and working memory tasks predicts compositional neural architecture

[I8R8]

Yang, Song, Newsome, and Wang (pp2017) trained a rate-coded recurrent neural network with 256 hidden units to perform a variety of classical cognitive tasks. The tasks combine a number of component processes including evidence accumulation over time, multisensory integration, working memory, categorization, decision making, and flexible mapping from stimuli to responses. The tasks include:

speeded response indicating the direction of the stimulus (stimulus-response mapping)
speeded response indicating the opposite of the direction of the stimulus (flexible stimulus-response mapping)
response indicating the direction of a stimulus after a delay during which the stimulus is not visible (working memory)
decision indicating which of two noisy stimulus inputs is stronger (evidence accumulation)
decision indicating which of two ranges of the stimulus variable the stimulus falls in (categorization)

The 20 distinct tasks result from combining in various ways the requirements of accumulating stimulus evidence from two sensory modalities, maintaining stimulus evidence in working memory during a delay, deciding which category the stimulus fell in, and flexible mapping to responses.

The tasks reduce cognition to its bare bones and the model abstracts from the real-world challenges of perception (pattern recognition) and motor control, so as to focus on the flexible linkage between perception and action that we call cognition. The input to the model includes a “fixation” signal, sensory stimuli varying along a single circular dimension, and an rule input, that specifies a task index.

The fixation signal is given through a special unit, whose activity corresponds to the presence of a fixation dot on the screen in front of a primate subject. The fixation signal accompanies the perceptual and maintenance phases of the task, and its disappearance indicates that the primate or model should respond. The sensory stimulus (“direction of stimulus from fixation”) is encoded in a set of direction-tuned units representing the circular dimension. Each of two sensory modalities is represented by such a set of units. The task rule is entered in one-hot format through a set of task units that receive the task index throughout performance of a task (no need to store the current task in working memory). The motor output is a “saccade direction” encoded, similarly to the stimulus, by a set of direction-tuned units.

Such tasks have long been used in nonhuman primate cell recording and human imaging studies, and also in rodent studies, in order to investigate how basic building blocks of cognition are implemented in the brain. This paper provides an important missing link between primate cognitive neurophysiology and rate-coded neural networks, which are known to scale to real-world artificial intelligence challenges.

Unsurprisingly, the authors find that the network learns to perform all 20 tasks after interleaved training on all of them. They then perform a number of well-motivated analyses to dissect the trained network and understand how it implements its cognitive feats.

An important question is whether particular units serve task-specific or task-general functions. One extreme hypothesis is that each task is implemented in a separate set of units. The opposite hypothesis is that all tasks employ all units. In order to address the degree of task-generality of the units, the authors measure the extent to which each unit conveys relevant information in each task. This is measured by the variance of a unit’s activity across different conditions within a task (termed the task variance). The authors find that the network learns to share some of the dynamic machinery it learns among different tasks.

ScreenShot2201 — Figure 4 from the paper shows the extent to which two tasks are subserved by disjoint or overlapping sets of units. Each panel shows a comparison between two tasks (decision making about modality 1, DM1; delayed decision making about modality 1, Dly DM 1; Context-dependent decision making about modality 1, Ctx DM 1; delayed match to category, DMC; delayed non-match to category, DNMC). The histograms show how the 256 units are distributed in terms of their “fractional task variance” (FTV), which measures the degree to which a unit conveys information in task 1 (FTV = -1), in task 2 (FTV = 1) or in both equally (FTV = 0).

The authors find evidence for a compositional implementation of the tasks in the trained network. Compositionality here means that the tasks employ overlapping sets of functional components of the network. Rather than learning a separate dynamic systems for each task, the network appears to learn dynamic components serving different functions that can be flexibly combined to enable performance of a wide range of tasks.

The authors’ argument in favor of a compositional architecture is based on two observations: (1) Pairs of tasks that share cognitive component functions tend to involve overlapping sets of units. (2) Task-rule inputs, though trained in one-hot format, can be linearly combined (e.g. Delay Anti = Anti + Delay Go – Go) and the network given such a task specification (which it has never been trained on) will perform the implied task with high accuracy.

ScreenShot2202 — Figure 6 from the paper supports the argument that the network learns a compositional architecture. During training, the task rule index is given in the form of a one-hot vector (a). The trained network can be given a linear combination of the trained task rules (c), such that the that adding and subtracting component functions (e.g. anti-mapping of stimuli to responses, working memory maintenance over delay, speeded reaction) according to the weights specifies a different task (Delay Anti = Anti + Delay Go – Go). The network then performs the compositionally specified task with high performance, although the task rule input corresponding to that task was 0.

These analyses are interesting because they help us understand how the network works and because they can also be applied to primate cell recordings and help us compare models to brains.

When the network is sequentially trained on one task at a time, the learning of new tasks interferes with previously acquired tasks, reducing performance. However, a continual learning technique that selectively protects certain learned connections enabled sequential acquisition of multiple tasks.

Overall, this is a highly original paper presenting a simple, yet well-motivated model and several useful analysis methods for understanding biological and artificial neural networks. The model extends the authors’ previous work on the neural implementation of some of these components of cognition. Importantly, the paper helps strengthen the link between rate-coded neural network models and primate (and rodent) cognitive neuroscience.

Strengths

The model is simple and well-designed and helps us imagine how basic components of cognition might be implemented in a recurrent neural network. It is essential that we build task-performing models to complement our fallible intuitions as to the signatures of cognitive processes we should expect in neuronal recordings.
The paper links primate cognitive neurophysiology to rate-coded neural networks trained with stochastic gradient descent. This might help boost future interactions between neurophysiologists and engineers.
The measures and analyses introduced to dissect the network are well-motivated, straightforward, and imaginative. Several of them can be equally applied to models and neuronal recordings.
The paper is well-written, clear, and tells an interesting story.
The figures are of high quality.

Weaknesses

The tasks are so simple that they do not pose substantial computational challenges. This is a strength because it makes it easier to understand neuronal responses in primate brains and unit responses in models. We have to start from the simplest instances of cognition. However, it is also a weakness. Consider the comparison to understanding the visual system. One approach is to reduce vision to discriminating two predefined images. The optimal algorithm for this task is a linear filter applied to the image. The intuitive reduction of vision to this scenario supports the template-matching model. However, this task and its optimal solution fundamentally misconstrues the challenge of visual recognition in the real world, which has to deal with complex accidental variation within each category to be recognized. The dominant current vision model is provided by deep neural networks, which perform multiple stages of nonlinear transformation and learn rich knowledge about the world. Simple cognitive tasks provide a starting point, but – like the two-image discrimination task in vision – abstract away many essential features of cognition. In vision, models are tested in terms of their performance on never seen images – a generalization challenge at the heart of what vision is all about. In cognition as well, we ultimately have to engage complex tasks and test models in terms of their ability to generalize to new instances drawn randomly from a very complex space. The paper leaves me wondering how we can best take small steps from the simple tasks dominating the literature toward real-world cognitive challenges.
The paper does not compare a variety of models. Can we learn about the mechanism the brain employs without comparing alternative models? Rate-coded recurrent neural networks are universal approximators of dynamical systems. This property is independent of particular choices defining the units. It is entirely unsurprising that such a model, trained with stochastic gradient descent, can learn these tasks (and the supertask of performing all 20 of them). Given the simplicity of the tasks, it is also not surprising that 256 recurrent units suffice. In fact, the authors report that the results are robust between 128 and 512 recurrent units. The value of this project consists in the way it extends our imagination and generates hypotheses (to be tested with neuronal recordings) about the distributions of task-specific and task-general units. The simplicity of the model and its gradient descent training provides a compelling starting point. However, there are infinite ways a recurrent neural network might implement performance at these tasks. It will be important to contrast alternative task-performing models and adjudicate between them with brain and behavioral data.
The paper does not include analyses of biological recordings or behavioral data, which could help us understand the degree to which the model resembles or differs from the primate brain in the way it implements task performance.

Addressing all of these weaknesses could be considered beyond the scope of the current paper. But the authors should consider if they can go toward addressing some of them.

Suggested improvements

(1) It might be useful to explicitly model the 20 tasks in terms of cognitive component functions (multisensory integration, evidence accumulation, working memory, inversion of stimulus-response mapping, etc.). The resulting matrix could be added to Table 1 or shown separately. This compositional cognitive description of the tasks could be used to explain the patterns of unit involvement in different tasks (e.g. as measured by task variance) using a linear model. The compositional model could then be inferentially compared to a non-compositional model in which each task is has a single cognitive component function. This more hypothesis-driven approach might help to address the question of compositionality inferentially.

(2) The depiction of the neural network model in Figure 1 could give a better sense of the network complexity and architecture. Instead of the three-unit icon in the middle, how about a directed graph with 256 dots, one for each recurrent unit, and a separate circular arrangements of input and output units (how many were there?). Instead of the network-unit icon with the cartoon of the nonlinear activation, why not show the actual softplus function?

(3) It would the good to see the full 256² connectivity matrix (ordered by clusters) and the network as a graph with nodes arranged by proximity in the connectivity matrix and edges colored to indicate the weights.

(4) The paper states that “the network can maintain information throughout a delay period of up to five seconds.” What does time in seconds mean in the context of the model? Is time meaningful because the units have time constants similar to biological neurons? It would be good to add supplementary text and perhaps a figure that explains how the pace of processing is matched to biological neural networks. If the pace is not compellingly matched, on the other hand, then perhaps real time units (e.g. seconds) should not be used when describing the model results.

(5) Please clarify whether the hidden units are fully recurrently connected. It would also be good to extend the paper to report how the density of recurrent connectivity affects task performance, learning, clustering and compositionality.

(6) The initial description of task variance is not entirely clear. State explicitly that one task variance estimate is computed for each task, reflecting the response variance across conditions within that task, and thus providing a measure of the stimulus-information conveyed during the task.

(7) Clustering is useful here as an exploratory and descriptive technique for dissecting the network, carving the model at its joints. However, clustering methods like k-means always output clusters, even when the data are drawn from a unimodal continuous distribution. The title claim of “clusters” thus should ideally be substantiated (by inferential comparison to a continuous model) or dropped.

(8) The clustering will depend on the multivariate signature used to characterize each unit. Instead of task variance patterns, a unit’s connectivity (incoming and outgoing) could be used as a signature and basis of clustering. How do results compare for this method? My guess is that using the task variance pattern across tasks tends to place units in the same cluster if they contribute to the same task, although they might represent different stimulus information in the task. If this is the motivation, it would be good to explain it more explicitly.

(9) It is an interesting question whether units in the same cluster serve the same function. (It seems unlikely in the present analyses, but would be more plausible if clustering were based on incoming and outgoing weights.) The hypothesis that units in a cluster serve the same function could be made precise by saying that the units in a cluster share the same patterns of incoming and outgoing connections, except for weight noise resulting from the experiential and internal noise during training. Under this hypothesis incoming weights are exchangeable among units within the same cluster. The same holds for outgoing weights. The hypothesis could, thus, be tested by shuffling the incoming and the outgoing weights within each cluster and observing performance. I would expect performance to drop after shuffling and would interpret this as a reminder that the cluster-level summary is problematic. Alternatively, to the extent that clusters do summarize the network well, one might try to compress the network down to one unit per cluster, by combining incoming and outgoing weights (with appropriate scaling), or by training a cluster-level network to approximate the dynamics of the original network.

(10) The method of t-SNE is powerful, but its results strongly depend on the parameter settings, creating an issue of researcher degrees of freedom. Moreover, the objective function is difficult to state precisely in a single sentence (if you disagree, please try). Multidimensional scaling by contrast uses a range of objective functions that are easy to define in a single sentence. I wonder why t-SNE should be preferred in this particular context.

(11) Another way to address compositionality would be to assess whether a new task can be more rapidly acquired if its components have been trained as part of other tasks previously.

(12) In Fig. 3 c and e, label the horizontal axis (cluster).

(13) It is great that the Tensorflow implementation will be shared. It would be good if the model data could also be shared in formats useful to people using Python as well as Matlab. This could be a great resource for students and researchers. Please state more completely in the main paper exactly what (Python code? Task and model code? Model data?) will be available where (Github?).

(14) After sequential training, performance at multisensory delayed decision making does not appear to suffer compared to interleaved training. Was this because multisensory delayed decision making was always the last task (thus not overwritten) or is it more robust because it shares more components with other tasks?

(15) A better word for “linear summation” is “sum”.

Deep convolutional networks explain substantial variance in fMRI responses during movie viewing

[I7R8]

Wen, Shi, Zhang, Lu & Liu (pp2016) used a deep feedforward convolutional neural network (CNN) as an encoding model for fMRI data acquired while human subjects viewed movies. Previous studies (Yamins et al. 2014; Khaligh-Razavi & Kriegeskorte 2014; Güçlü & van Gerven 2015; Eickenberg et al. 2016) found that deep convolutional networks provide good models of the representation of natural images in higher ventral-stream areas. Wen et al.’s deep-net prediction of fMRI responses during movie viewing extends these findings to dynamic viewing conditions (see also Eickenberg et al. 2016 for initial deep net analyses of movie fMRI data).

The deep net was similar to AlexNet in architecture and had been trained to classify static images. The model processed each frame separately through its purely feedforward hierarchy and had no mechanism for recognising visual motion and more complex visual dynamics. Nevertheless, it did quite well at explaining visual fMRI responses during movie viewing. This is perhaps expected based on previous studies of fMRI responses during movie viewing (Hasson et al. 2001), which showed that category-selective regions respond as expected during movie viewing, when their preferred categories are present in the scene.

The authors conduct a series of original and well-motivated analyses, including decoding and deconvolution, to explore the degree to which the deep network captures the dynamic representations at multiple levels. Overall, this is a technically excellent study providing further evidence that deep feedforward neural networks constitute good models of ventral-stream responses, with layers roughly corresponding to stages of processing along the ventral stream.

Results are largely consistent with expectations from previous studies, but the study is a good contribution because it replicates and generalises earlier findings. We are wondering whether more novel insights could be gained by empirically demonstrating the limitations of the model (feedforward, no motion or dynamic perception). This could be achieved by showing that the model does not explain all the explainable variance in the data and where in the brain the deep convolutional feedforward model falls short. Eventually, the field will need to compare multiple deep neural network models (including feedword models that take multiple frames as input, and recurrent models that can dynamically compress the recent stimulus history as biological visual systems probably do).

Figure: Visualising the image features that drive fMRI voxel responses in the context of particular images. Using the deconvolutional visualisation technique of Zeiler & Fergus on voxel encoding models (deep convolutional network + ridge regression), enables the authors to visualise to what extent adjusting pixels of a particular image (photos on the left) would change the encoding model’s prediction about voxel activity. Results shown here are for three voxels at different stages of the hierarchy. Voxel 4 has a localised central small receptive field. Voxels 5 and 6 have larger receptive fields.

Main findings

Voxel-to-units matching characterises cortical responses

The paper demonstrates the functional alignment between the visual cortex and the CNN using a simple and effective method: each voxel’s time course during movie viewing is correlated with the output of each CNN unit when exposed to the frames of the movie (after convolution with a kernel to account for the delay and smoothness of the hemodynamic response). The highest-correlating units are then considered as matches. Note that this simple matching of each voxel to a set of units is easier to interpret than fitting a linear encoding model, which would have continuous weights spread over a large number of units. A disadvantage of simple voxel-to-unit matching might be that it doesn’t account for the averaging in fMRI voxels.

For early visual areas, the authors estimate population receptive field properties of each voxel by averaging the matching CNN units’ receptive field properties. The authors restrict this analysis to layer 1 of the CNN and recover plausible retinotopic maps from the movie responses with this method.
Across the stages of the visual hierarchy, the authors label each voxel with the CNN layer that best explains it (see also Güçlü & van Gerven 2015). Here the authors assigned each voxel to the CNN layer containing the unit that achieved the maximum correlation with the voxel response. Consistent with the previous studies, this reveals a largely monotonic correspondence between the stages of the visual hierarchy and the layers of the CNN.

The CNN layer-8 face detector unit is correlated with the fusiform face area

The authors correlated voxels with an output unit trained to detect faces. The resulting univariate brain map highlights face-selective regions, including the fusiform face area (FFA). We already knew that FFA responds to faces, and does so during natural vision (movies; e.g. work by Hasson et al.). We also know that CNNs can recognise faces. Putting these together, CNNs must be able to predict responses of the FFA to some degree. Indeed previous studies have already explicitly shown that CNNs explain variance in FFA (Khaligh-Razavi et al. 2014, Fig. S3b; Eickenberg et al. 2016). Being able to detect faces is necessary for a computational account of FFA, but it is not sufficient. We are left wondering whether the model can explain the FFA profile of activation within faces and within inanimate images (Mur et al. 2012) or the representational geometry within those two categories.

Encoding models linearly combining CNN units can predict voxel responses to novel movie segments

The authors used CNN units as nonlinear image features of ridge regression models of voxel responses. They logarithmically transformed the CNN outputs to model a saturating response that better matches the distribution of fMRI response amplitudes across images. They convolved the unit responses to movie frames with a hemodynamic response function to model the smoothness and delay of fMRI responses. As in previous studies, these models were tested for generalisation to novel stimuli (different movie segments here). They explained significant variance across large swaths of visual cortex.

FFA encoding model prefers face images

The authors attempted to visualise the preferences of the FFA by presenting 20,400 images from the same 15 categories to the encoding model. These images were not used for training the deep net and were not part of the fMRI experiments used to train the encoding model. Of the 1,000 (out of 20,400) images, for which the FFA voxel’s encoding model predicted the greatest response, 90.4% were faces.

The authors then averaged the images most strongly driving the FFA encoding model. They state: “Strikingly, the average visualization showed a blurred but discernable picture of a human face (Fig. 3.c, middle). For the first time, this result provides the direct visualization of the highly face-selective functional representation at FFA.”

This result is not particularly compelling and invites incorrect interpretations. It reflects the known fact that FFA responds to faces (Kanwisher et al. 1997) and the central/frontal-view bias for faces in the image set used to make the visualisation. Visualising the FFA response as an image template misses the point of a deep hierarchy of nonlinear transformation: If an image template could characterise the response of FFA, multiple nonlinear transformations would not be needed.

To test for the necessity of multiple nonlinear stages, the encoding model could be replaced by a template matcher. More generally, model compression could be used to attempt to explain the FFA response with a shallower model. This might yield deeper insights into the computations underlying FFA responses.

Deep convolutional encoding models can be combined with deconvolutional inversion to visualise what image features drive individual voxels

The authors use the deconvolutional visualisation technique of Zeiler & Fergus on voxel encoding models (deep convolutional network + ridge regression). This is a good idea. It enables the authors to visualise, for a particular image, what pixels most strongly affect the encoding model’s prediction of voxel activity. Again, a single template image cannot characterise a deep convolutional encoding model. However, looking at the encoding model’s gradient in image space in the context of many probe images might help us understand the model’s computation.

Images and semantic categories can be decoded not only within, but also between subjects

The authors analyse to what extent encoding and decoding models fitted with one subject’s data generalise to other subjects. It’s good to have these analyses in the paper because they give us a more quantitative sense of the degree to which the anatomical intersubject alignment succeeds at matching up functionally similar responses. As expected, within-subject encoding and decoding predictions are more accurate, but generalisation across subjects does work somewhat.

Decoding models can reconstruct natural movie stimuli and reveal semantic categories

The authors estimated multivariate regression models to predict time-varying feature maps of CNN layer 1 based on distributed cortical fMRI signals. Training was performed using an L1 (sparsity encouraging) penalty on the weights and random dropping out of a subset of the voxels during training with stochastic gradient descent. The layer-1 representation was then converted back to an image. As expected from previous studies (Thirion et al., Nishimoto et al., Miyawaki et al.) this works to some extent, highlighting regions of the image with high edge energy. The authors also decode semantic categories, which works quite well (again, not unexpectedly).

Strengths

tests a task-performing neurobiologically plausible (though abstract) deep neural network model
replicates previous findings on deep convolutional nets as models of fMRI visual responses and generalises these findings to movie viewing
analyses technically excellent in many respects
analyses are comprehensive and well motivated
brings the deconvolutional technique of for visualization of the model unit preferences (as gradient in image space in the context of particular preferred images) into the analysis of brain data
high-quality figures
well written

Weaknesses

The authors follow the widespread approach of looking to confirm what the field already believes. The rationale of this approach could be described as follows: “Here’s a really cool and novel demonstration of what you already believe.” It is as though we are afraid to learn something new or do not trust our methods unless they confirm current opinion. If our methods are sound, we should be able to apply them to questions whose answers we don’t know yet. As a result, the analyses here largely confirm what we know (or think we know) rather than providing surprising or contradictory new leads.
Some of the analyses shown are affected by selection bias. This is a widespread problem in systems neuroscience (Kriegeskorte et al. 2009). While the authors are certainly aware of the related challenges of overfitting, circularity, and selection bias (using separate training and test sets for both deep net training, and the fitting of encoding models), selective presentation of results in Figs. 2d, 3a, 7 might be misleading. It’s important to note that this problematic practice is widespread, motivated by a rationale that could be stated as follows: “Let me show what you already believe in a beautiful graph based on data selected to conform with what you believe and therefore biased (to an unknown degree) to confirm your belief.”

Points the authors may wish to address in revision

Selection bias

State clearly what data and criteria have been used to select voxels and ROIs whose results are shown (Figs. 2d, 3a, 3c, 7). Use independent data sets (1) for selection and ROI definition and (2) for estimating and plotting effects in the ROIs.

Inference must account for serial autocorrelation of fMRI time courses

Time points in fMRI are highly dependent, so cannot be considered exchangeable under H0. You could use prewhitening (e.g. the Cochrane-Orcutt method) or simulate a realistic null distribution by block permutation. Temporal block permutation could use long blocks and just enough of them to have a sufficient number of permutations. Some contexts where this applies are as follows:

“The significance of the cross correlation was assessed by calculating the p-value with the degree of freedom equal to the number of time points minus 2 (DOF=238, p<0.001, Bonferroni correction for the number of voxels).”

“The significance of the cross correlation was assessed by calculating the p-value with the degree of freedom equal to the number of time points minus 2 (DOF=238 with Bonferroni correction to account for the number of voxels and p<0.001).”

“To test the statistical significance of the average prediction accuracy, we performed a permutation test. From the fMRI-estimated feature maps, a large set of permuted feature maps were created by randomly and temporally shuffling the estimated feature maps for 10000 times.”
The shuffled entities should not be single time points.

“For example, a ‘face’ neuron in the CNN was significantly correlated with the fusiform face area (FFA) (r=0.25±0.057, p<0.01, corrected) (Fig. 2.d, right).” Does this take serial autocorrelation into account?

Analyse where the deep convolutional net falls short

It would be great to visualise the ways the deep convolutional encoding model falls short.

To what extent can the replicable movie-responses in higher parietal, temporal and frontal cortices that appear in Fig. 2a be explained by the deep net? What layers best explain them?
In addition to the whole-brain cortical surface map of where the deep net explains significant variance, it would be good to see where it explains significantly less variance than another measurement for the same subject and the same movie segment. This would reveal where in the brain the deep net fails to explain all explainable variance.
It would also be good to show what aspects of visual responses the deep net misses in regions where it performs well overall. One way to do this would be to compare the fMRI-based image reconstruction to an image reconstruction obtained with the same method from the encoding-model-predicted fMRI responses. We might expect, for example, that the encoding model misses the attentional mechanisms of the brain. The reconstructions from the encoding model prediction might therefore reflect the edge energy in the image, equally representing foreground object and background scene, whereas the fMRI reconstruction reflects attentional selection of the foreground object.

Deconvolution cannot invert a deep convolutional feedforward network

Several passages in the paper currently suggest that deconvolution exactly inverts the convolutional feedforward processing, for example:

“It is worth noting that nonlinear features coded in the CNN were not isolated functions, but connected through hierarchical networks that are fully computable either bottom-up or top-down.”
“In addition, the model is fully observable and computable both forward and backward (Zeiler and Fergus, 2014), such that the extracted features can be transformed, either top-down or bottom-up, to visualize their internal representations, to reconstruct the visual input, as well as to deduce its semantic categorization.”

However, rectified linear units and max-pooling are noninvertible, image-dependent switches and “transpose as inverse” only provide approximations, and many dimensions are intentionally discarded across stages of the feedforward hierarchy. Please clarify these approximations and how they relate to (a) the goal of reconstruction of the image and (b) the goal to visualise the image features driving particular model responses.

Open science

Will data and code be available for other researchers to build on the findings?

Image reconstruction with error-flow from all layers

Reconstruction of natural movie stimuli is performed based on the decoding from fMRI of the 1st layer of the CNN only. Higher layers of the CNN should be better decodable from higher cortical regions. The ideal reconstruction would simultaneously decode all layers of the CNN from all cortical regions, using this information as a joint constraint for image reconstruction. Image reconstruction could rely on backpropagation down to the image input, with error flow from all layers of the CNN.

fMRI-like preprocessing of the CNN outputs

The authors take the logarithm of CNN outputs and convolve with an HRF model kernel peaking at 4 s latency. It would be good to know how this reasonable approach performs in comparison to other approaches, e.g. leaving out the logarithmic transform and/or estimating the HRF and its peak latency for each voxel (e.g. Pedregosa et al. pp2014).

Motivation for different weight priors implicit to regularisation in encoding and decoding

In the encoding models, the weight regularizer is an L2 penalty (ridge regression), whereas in the decoding models it is an L1 penalty (LASSO).

What’s the motivation for these choices?

Were both approaches tried for both encoding and decoding? If results were different, what would be the implications?

What is the motivation for dropout (another approach to regularisation) in fitting the decoders? Is it to prevent informative voxels to be pushed down because the L1 penalty encourages sparsity?

Language

There are some typos and minor grammatical errors (e.g. “in details” -> “in detail”).

– Nikolaus Kriegeskorte & Johannes Mehrer

Can the contents of consciousness be decoded from patterns of integrated information?

[I8R5]

Consciousness is fascinating and elusive. There is “the hard problem” of how the dynamics of matter can give rise to subjective experience. “The hard problem” (Chalmers) is how some philosophers describe their own job, a job that is both appropriately glamorous and career safe, because it is not about to be taken away from them anytime soon and so difficult that lack of progress in our lifetime cannot reasonably be held against them. Brain scientists are left with “the easy problem” of explaining how the brain supports perception, cognition, and action. What’s taking so long?

Transcending this division of labour, at the intersection between philosophy and brain science, researchers are working on what Anil Seth has called “the real problem”:

“how to account for the various properties of consciousness in terms of biological mechanisms; without pretending it doesn’t exist (easy problem) and without worrying too much about explaining its existence in the first place (hard problem).”

There is a range of interesting ideas toward a theory of consciousness, from metaphors like “fame in the brain” to detailed accounts like the “neuronal global workspace” (Baars, Dehaene). In my mind, it remains unclear to what extent existing proposals are alternative theories that are mutually exclusive or complementary descriptions of the same set of phenomena.

One of the more inspiring sets of ideas about consciousness is integrated information theory (Tononi). IIT posits that consciousness arises from the interactions between the parts of a physical system and allows of degrees. The degree of consciousness of a system can be measured by an index of the overall interactivity among the parts.

States of heightened consciousness are states in which we experience an enhanced capacity to bring together in the present moment all we perceive and all we know with our needs and goals, toward adaptive action.

Our brains, mysteriously, perform an amazing feat of flexible integration of information across many scales of time (from long-term memories to our current situational model and to the momentary glimpse, in which we sense the states of motion of the objects around us), across our peripersonal space (from the scene surrounding us, in memory, to the fixated point), and across sensory modalities (as we combine what we see, hear, feel, smell and taste into an amodal percept of the scene).

And this is just the perceptual part of the process, which is integrated with our sense of current needs and goals to guide our action. It is plausible that this feat of intelligence, which is unmatched by current AI systems, requires high-bandwidth interactions between the brain components that sustain it. IIT suggests that those pieces of information, from perception or memory, that are currently most richly interrelated are the conscious ones. This doesn’t follow, but it is an interesting idea.

Intuitions about social interaction similarly suggest that interactivity is essential for efficient information processing. For example, it is difficult to imagine a team of people working together optimally efficiently on a complex task, if a subset of them is not integrated, i.e. does not interact with the rest of the group. Of course, there are simple tasks, for which independent toiling is optimal. I’m thinking here of tasks that do not require considering all the relationships between subsets of the input. But for complex tasks, like writing a paper, we might expect substantial interactivity to be required.

In computer science, NP hard tasks are those, for which no trick exists that would enable us to partition the elements into a manageable set of subsets, and tackle each in turn. Instead relationships among elements may need to be considered for all subsets, and the number of subsets is exponential in the number of the elements. The elements have to be brought into contact somehow, so we expect the system that can solve the task efficiently to be highly interactive.

A key idea of IIT is that a conscious system should be well integrated in the sense that no matter how we partition it, the partitions are highly interactive. IIT uses information theoretic measures to quantify integrated information. These measures are related to Granger causality. For two components A and B, A is said to Granger-cause B if the past values of A help predict B, beyond what can be achieved by considering only the past of B itself. For the same system composed of parts A and B, a measure of integrated information would assess to what extent taking the interactions between A and B into account enables us to better predict the state of the system (comprising both A and B) than ignoring the interactions.

For a more complex system, integrated information measures consider all subsets. The integrated information of the whole is the maximum of the integrated information values of the subsets. In other words, the system inherits its level of integrated information φ^max from its most strongly interactive clique of components. Each subset’s interactivity is judged by the degree to which it cannot be partitioned (and interactions across partitions ignored) in predicting the current state from the past. A system is considered highly interactive if any partitioning greatly reduces an estimate of the mutual information between its past and present states.

Note that to achieve high integrated information, the information flow must not simply spread the information, rendering it redundant across the parts. Rather the information in different parts must be complementary and must be encoded such that it needs to be considered jointly to reveal its meaning.

For example, consider binary variables X, Y, and Z. X and Y are independent uniform random variables and Z = X xor Y, i.e. Z=1 if either X or Y is 1, but not both. Each variable then has an entropy of one bit. X and Y each singly contain no information about Z. Being told X does not tell us anything about Z, because Y is needed to interpret the information X conveys about Z. Conversely, X is needed to interpret the information Y conveys about Z. X and Y together perfectly determine Z. (The mutual information I(X;Z) = 0, the mutual information I(Y;Z)=0, but the mutual information I(X,Y;Z) = 1 bit.)

Figure | The continuous flash suppression paradigm used by the authors. A stimulus presented to one eye is rendered invisible by a sequence of Mondrian images presented to the other eye.

In a new paper, Haun, Oizumi, Kovach, Kawasaki, Oya, Howard, Adolphs, and Tsuchiya (pp2016) derive some interesting predictions from integrated information theory and test them with electrocorticography, measuring neuronal activity in human patients that have implanted subdural electrodes in their brains. The authors use the established psychophysical paradigms of continuous flash suppression and backward masking to render stimuli that are processed in cortex subjectively invisible and their representations, thus, unconscious.

The paper uses the previously described measure φ* of integrated information. This measure uses estimates of mutual information between past and present states of a set of measurement channels. The mutual information is estimated on the basis of multivariate Gaussian assumptions. Computing φ* involves estimating the effects of partitioning the set of channels, by modelling the partition distributions as independent (i.e. the joint distribution obtains as the product of the partitions’ distribution). φ* is the loss in system past-to-present predictability incurred by the least destructive partitioning.

The paper introduces the concept of the φ* pattern, the pattern of φ* estimates across subsets of components of the system (where electrodes pragmatically serve to define the components). The φ* pattern is hypothesized to reflect the compositional structure of the conscious percept.

Results suggest that stronger φ* values for certain sets of electrodes in the fusiform gyrus, which pick up on face-selective responses, are associated with conscious percepts of faces (as opposed to Mondrian images or visual noise). This association holds even across sets of trials, where the physical stimulus was identical and only the internal dynamics rendered the face representation conscious or unconscious. The authors argue that these results support IIT and suggest that the φ* pattern reflects information about the conscious percept.

Strengths

The ideas in the paper are creative, provocative, and inspiring.
The paper uses well-established psychophysical paradigms to control the contents of consciousness and disentangle conscious perception from stimulus representation.
The φ* measure is well motivated by IIT and has been introduced in earlier work involving some of the authors – even if its relationship to consciousness is speculative.

Weaknesses

The authors introduce the φ* pattern and hypothesize that it reflects the compositional structure of conscious content. However, theoretically, it is unclear why it should be that pattern across subsets of components, rather than simply the pattern across components that reflects the compositional structure of conscious content. Empirically, results are most parsimoniously summarised by saying that φ* tends to be larger when the content represented by the underlying neuronal population is conscious. The evidence for a reflection of the compositional structure of conscious content in the φ* pattern is weak.
It is unclear how φ* is related to the overall activity in sets of neurons selective for the perceptual content in question (faces here). This leaves open the possibility that face selective neurons are simply more active when the face percept is conscious and this greater activity is associated with greater interactivity among the neurons, reflecting their structural connectivity.
The finding that the alternative measures, state entropy H and (past-present) mutual information I, are less predictive of conscious percepts does not provide strong constraints on theory, because these measures are not particularly plausible to begin with and no compelling theoretical motivation is given for them.
IIT suggests that integrated information across the entire brain supports consciousness. An unavoidable challenge for empirical studies, as the authors appropriately discuss, is the limitation of the φ* estimates to small sets of empirical measurements of brain activity.

Particular points the authors may wish to address in revision

**(1) Are face-selective populations of neurons simply more active when a face is consciously perceived and φ* rises as an epiphenomenon of greater activity in the interconnected set of neurons?**

It is left unclear whether the level of percept-specific neuronal activity provides a comparably good or better neural correlate of conscious content. The data presented have been analysed with more conventional activity-based pattern classification in Baroni et al. (pp2016) and results suggest that this also works. What if the substrate of consciousness is simply strong activity or activity in certain frequency bands and the φ* just happens to be a measure correlated with those simpler measures in a population of neurons? After all, we would expect an interconnected neuronal population to exhibit greater dynamic interactivity when it is strongly driven by a stimulus. The key challenge left unaddressed is to demonstrate that φ* cannot be reduced to this classical neuronal correlate of perceptual content. Do the two tend to be correlated? Can they be disentangled experimentally?

A compelling demonstration would be to show that φ* (or another IIT-motivated measure) captures variance in conscious content that is not explained by conventional decoding features. For example, two populations of neurons – one coding a face, the other a Mondrian – might be equally activated overall by a stimulus containing both a face and a Mondrian, but φ* computed for each population might enable us to predict the consciously perceived stimulus on a trial-by-trial basis.

**(2) Does the φ* pattern reflect the conscious percept and its compositional structure?**

A demonstration that the φ* pattern (across subsets) reflects the compositional structure of the content of consciousness would require an experiment eliciting a wider range of conscious percepts that are composed of a set of elements in different combinations.

The authors’ hypothesis would then have to be compared to a range of simpler hypotheses about the neural correlates of compositional conscious content (NC⁴) , including the following:

the pattern of activity across content-selective neural sites
(rather than the across a subsets of sites)
the pattern of activity across subsets of sites
the connectivity across content-selective neural sites (where connectivity could be measured by synchrony, coherence, Granger causality or any other measure of the relationship between two sites)
the connectivity across content-selective neural subsets of sites

This list could be expanded indefinitely and could include a variety of IIT-inspired but distinct NC⁴s. There are many ideas that are similarly theoretically plausible, so empirical tests might be the best way forward.

In the discussion the authors argue that integrated information has greater a priori theoretical support than arbitrary alternative neural correlates of consciousness. There is some truth to that. However, the theoretical motivation, while plausible and and interesting, is not so uniquely compelling that it supports lowering the bar of empirical confirmation for IIT measures.

**(3) Might the selection of channels by maximum φ* have introduced a bias to the analyses?**

I understand that the selection was performed without using the conscious/unconscious trial labels. However, conscious percepts are likely to be associated with greater activity, and φ* might be confounded by greater activity. More generally, selection biases are often complicated, and without a compelling demonstration that there can be no selection bias, it is difficult to be confident. A simple way to rule out selection bias is to use independent data for selection and selective analysis.

– Nikolaus Kriegeskorte

Acknowledgement

I thank Kate Storrs for discussing integrated information theory with me.

Discrete-event-sequence model reveals the multi-time-scale brain representation of experience and recall

[I8R7]

Baldassano, Chen, Zadbood, Pillow, Hasson & Norman (pp2016) investigated brain representations of event sequences with fMRI. The paper argues in favour of an intriguing and comprehensive account of the representation of event sequences in the brain as we experience them, their storage in episodic memory, and their later recall.

The overall story is quite amazing and goes like this: Event sequences are represented at multiple time scales across brain regions during experience. The brain somehow parses the continuous stream of experience into discrete pieces, called events. This temporal segmentation occurs at multiple temporal scales, corresponding perhaps to a tree of higher-level (longer) events and subevents. Whether the lower-level events precisely subdivide higher-level events (rendering the multiscale structure a tree) is an open question, but at least different regions represent event structure at different scales. Each brain region has its particular time scale and represents an event as a spatial pattern of activity. The encoding in episodic memory does not occur continuously, but rather in bursts following the event transitions at one of the longer time scales. During recall from memory, the event representations are reinstated, initially in the higher-level regions, from which the more detailed temporal structure may come to be associated in the lower-level regions. Event representations can arise from perceptual experience (a movie here), recall (telling the story), or from listening to a narration. If the event sequence of a narration is familiar, memory recall can help reinstate representations upcoming in the narration in advance.

There’s previous evidence for event segmentation (Zacks et al. 2007) and multi-time-scale representation (from regional-mean activation to movies that are temporally scrambled at different temporal scales; Hasson et al. 2008; see also Hasson et al. 2015) and for increased hippocampal activity at event boundaries (Ben-Yakov et al. 2013). However, the present study investigates pattern representations and introduces a novel model for discovering the inherent sequence of event representations in regional multivariate fMRI pattern time courses.

The model assumes that a region represents each event k = 1..K as a static spatial pattern m_k of activity that lasts for the duration of the event and is followed by a different static pattern m_k+1 representing the next event. This idea is formalised in a Hidden Markov Model with K hidden states arranged in sequence with transitions (to the next time point) leading either to the same state (remain) or to the next state (switch). Each state k is associated with a regional activity pattern m_k, which remains static for the duration of the state (the event). The number of events for a given region’s representation of, say, 50 minutes’ experience of a movie is chosen so as to maximise within-event minus between-event pattern correlation on a held-out subject.

It’s a highly inspired paper and a fun read. Many of the analyses are compelling. The authors argue for such a comprehensive set of claims that it’s a tall order for any single paper to fully substantiate all of them. My feeling is that the authors are definitely onto something. However, as usual there may be alternative explanations for some of the results and I am left with many questions.

Strengths

The paper is very ambitious, both in terms brain theory and in terms of analysis methodology.
The Hidden Markov Model of event sequence representation is well motivated, original, and exciting. I think this has great potential for future studies.
The overall account of multi-time-scale event representation, episodic memory encoding, and recall is plausible and fascinating.

Weaknesses

Incomplete description and validation of the new method: The Hidden Markov Model is great and quite well described. However, the paper covers a lot of ground, both in terms of the different data sets, the range of phenomena tackled (experience, memory, recall, multimodal representation, memory-based prediction), the brain regions analysed (many regions across the entire brain), and the methodology (novel complex method). This is impressive, but it also means that there is not much space to fully explain everything. As a result there are several important aspects of the analysis that I am not confident I fully understood. It would be good to describe the new method in a separate paper where there is enough space to validate and discuss it in detail. In addition, the present paper needs a methods figure and a more step-by-step description to explain the pattern analyses.
The content and spatial grain of the event representations is unclear. The analyses focus on the sequence of events and the degree to which the measured pattern is more similar within than between inferred event boundaries. Although this is a good idea, I would have more confidence in the claims if the content of the representations was explicitly investigated (e.g. representational patterns that recur during the movie experience could represent recurring elements of the scenes).
Not all claims are fully justified. The paper claims that events are represented by static patterns, but this is a model assumption, not something demonstrated with the data. It’s also claimed that event boundaries trigger storage in long-term memory, but hippocampal activity appears to rise before event boundaries (with the hemodynamic peak slightly after the boundaries). The paper could even more clearly explain exactly what previous studies showed, what was assumed in the model (e.g. static spatial activity patterns representing the current event) and what was discovered from the data (event sequence in each region).

Particular points the authors may wish to address in revision

(1) Do the analyses reflect fine-grained pattern representations?

The description of exactly how evidence is related between subjects is not entirely clear. However, several statements suggest that the analysis assumes that representational patterns are aligned across subjects, such that they can be directly compared and averaged across subjects. The MNI-based intersubject correspondency is going to be very imprecise. I would expect that the assumption of intersubject spatial correspondence lowers the de facto resolution from 3 mm to about a centimetre. The searchlight was a very big (7 voxels = 2.1cm)³ cube, so perhaps still contained some coarse-scale pattern information.

However, even if there is evidence for some degree of intersubject spatial correspondence (as the authors say results in Chen et al. 2016 suggest), I think it would be preferable to perform the analyses in a way that is sensitive also to fine-grained pattern information that does not align across subjects in MNI space. To this end patterns could be appended, instead of averaged, across subjects along the spatial (i.e. voxel) dimension, or higher-level statistics, such as time-by-time pattern dissimilarities, could averaged across subjects.

If the analyses really rely on MNI intersubject correspondency, then the term “fine-grained” seems inappropriate. In either case, the question of the grain of the representational patterns should be explicitly discussed.

(2) What is the content of the event representations?

The Hidden Markov Model is great for capturing the boundaries between events. However, it does not capture the meaning and relationships between the event representations. It would be great to see the full time-by-time representational dissimilarity matrices (RDMs; or pattern similarity matrices) for multiple regions (and for single subjects and averaged across subjects). It would also be useful to average the dissimilarities within each pair of events to obtain event-by-event RDMs. These should reveal, when events recur in the movie, and the degree of similarity of different events in each brain region. If each event were unique in the movie experience, these RDMs would have a diagonal structure. Analysing the content of the event representations in some way seems essential to the interpretation that the patterns represent events.

(3) Why do the time-by-time pattern similarity matrices look so low-dimensional?

The pattern correlations shown in Figure 2 for precuneus and V1 are very high in absolute value and seem to span the entire range from -1 to 1. (Are the patterns averaged across all subjects?) It looks like two events either have highly correlated or highly anticorrelated patterns. This would suggest that there are only two event representations and each event falls into one of two categories. Perhaps there are intermediate values, but the structure of these matrices looks low-dimensional (essentially 1 dimensional) to me. The strong negative correlations might be produced by the way the data are processed, which could be more clearly described. For example, if the ensemble of patterns were centered in the response space by subtracting the mean pattern from each pattern, then strong negative correlations would arise.

I am wondering to what extent these matrices might reflect coarse-scale overall activation fluctuations rather than detailed representations of individual events. The correlation distance removes the mean from each pattern, but usually different voxels respond with different gains, so activation scales rather than translates the pattern up. When patterns are centered in response space, 1-dimensional overall activation dynamics can lead to the appearance of correlated and anticorrelated pattern states (along with intermediate correlations) as seen here.

This concern relates also to points (1) and (2) above and could be addressed by analysing fine-grained within-subject patterns and the content of the event representations.

screenshot1350

Detail from Figure 2: Time-by-time regional spatial-pattern correlation matrices.
Precuneus (top) and V1 (bottom).

(4) Do brain regions really represent a discrete sequence of events by a discrete sequence of patterns?

The paper currently claims to show that brain regions represent events as static patterns, with sudden switches at the event boundaries. However, this is not something that is demonstrated from the data, rather it is the assumption built into the Hidden Markov Model.

I very much like the Hidden Markov Model, because it provides a data-driven way to discover the event boundaries. The model assumption of static patterns and sudden switches are fine for this purpose because they may provide an approximation to what is really going on. Sudden switches are plausible, since transitions between events are sudden cognitive phenomena. However, it seems unlikely that patterns are static within events. This claim should be removed or substantiated by an inferential comparison of the static-pattern sequence model with an alternative model that allows for dynamic patterns within each event.

(5) Why use the contrast of within- and between-event pattern correlation in held-out subjects as the criterion for evaluating the performance of the Hidden Markov Model?

If patterns are assumed to be aligned between subjects, the Hidden Markov Model could be used to directly predict the pattern time course in a held-out subject. (Predicting the average of the training subjects’ pattern time courses would provide a noise ceiling.) The within- minus between-event pattern correlation has the advantage that it doesn’t require the assumption of intersubject pattern alignment, but this advantage appears not to be exploited here. The within- minus between-event pattern correlation seems problematic here because patterns acquired closer in time tend to be more similar (Henriksson et al. 2015). First, the average within-event correlation should always tend to be larger than the average between-event correlation (unless the average between-event correlation were estimated from the same distribution of temporal lags). Such a positive bias would be no problem for comparing between different segmentations. However, if temporally close patterns are more similar, then even in the absence of any event structure, we expect that a certain number of events best captures the similarity among temporally closeby patterns. The inference of the best number of events would then be biased toward the number of events, which best captures the continuous autocorrelation.

(6) More details on the recall reactivation

Fig. 5a is great. However, this is a complex analysis and it would be good to see this in all subjects and to also see the movie-to-recall pattern similarity matrix, with the human annotations-based and Hidden Markov Model-based time-warp trajectories superimposed. This would enable us to better understand the data and how the Hidden Markov Model arrives at the assignment of corresponding events.

In addition, it would be good to show statistically, that the Hidden Markov Model predicts the content correspondence between movie and recall representations consistently with the human annotations.

(7) fMRI is a hemodynamic measure, not “neural data”.

“Using a data-driven event segmentation model that can identify temporal structure directly from neural measurements”; “Our results are the first to demonstrate a number of key predictions of event segmentation theory (Zacks et al., 2007) directly from neural data”

There are a couple of other places, where “neural data” is used. Better terms include “fMRI data” and “brain activity patterns”.

(8) Is the structure of the multi-time-scale event segmentation a tree?

Do all regions that represent the same time-scale have the same event boundaries? Or do they provide alternative temporal segmentations? If it is the former, do short-time-scale regions strictly subdivide the segmentation of longer-time-scale regions, thus making the event structure a tree? Fig. 1 appears to be designed so as not to imply this claim. Data, of course, is noisy, so we don’t expect a perfect tree to emerge in the analysis, even if our brains did segment experience into a perfect tree. It would be good to perform an explicit statistical comparison between the temporal-tree event segmentation hypothesis and the more general multi-time-scale event segmentation hypothesis.

(9) Isn’t it a foregone conclusion that longer-time-scale regions’ temporal boundaries will match better to human annotated boundaries?

“We then measured, for each brain searchlight, what fraction of its neurally-defined boundaries were close to (within three time points of) a human-labeled event boundary.”

For a region with twice as many boundaries as another region, this fraction is halved even if both regions match all human labeled events. This analysis therefore appears strongly confounded by the number of events a regions represents.

The confound could be removed by having humans segment the movie at multiple scales (or having them segment at a short time scale and assign saliency ratings to the boundaries). The number of events could then be matched before comparing segmentations between human observers and brain regions.

Conversely, and without requiring more human annotations, the HMM could be constrained to the number of events labelled by humans for each searchlight location. This would ensure that the fraction of matches to human observers’ boundaries can be compared between regions.

(10) Hippocampus response does not appear to be “triggered” by the end of the event, but starts much earlier.

The hemodynamic peak is about 2-3 s after the event boundary, so we should expect the neural activity to begin well before the event boundary.

(11) Is the time scale a region represents reflected in the temporal power spectrum of spontaneous fluctuations?

The studies presenting such evidence are cited, but it would be good to look at the temporal power spectrum also for the present data and relate these two perspectives. I don’t think the case for event representation by static patterns is quite compelling (yet). Looking at the data also from this perspective may help us get a fuller picture.

(12) The title and some of the terminology is ambiguous

The title “Discovering event structure in continuous narrative perception and memory” is, perhaps intentionally, ambiguous. It is unclear who or what “discovers” the event structure. On the one hand, the brain that discovers event structure in the stream of experience. On the other hand, the Hidden Markov Model discovers good segmentations of regional pattern time courses. Although both interpretations work in retrospect, I would prefer a title that makes a point that’s clear from the beginning.

On a related note, the phrase “data-driven event segmentation model” suggests that the model performs the task of segmenting the sensory stream into events. This was initially confusing to me. In fact, what is used here is a brain-data-driven pattern time course segmentation model.

(13) Selection bias?

I was wondering about the possibility of selection bias (showing the data selected by brain mapping, which is biased by the selection process) for some of the figures, including Figs. 2, 4, and 7. It’s hard to resist illustrating the effects by showing selected data, but it can be misleading. Are the analyses for single searchlights? Could they be crossvalidated?

(14) Cubic searchlight

A spherical or surface-based searchlight would the better than a (2.1 cm)³ cube.

– Nikolaus Kriegeskorte

Acknowledgement

I thank Aya Ben-Yakov for discussing this paper with me.

How will the neurosciences be transformed by machine learning and big data?

[R8I7]

Machine learning and statistics have been rapidly advancing in the past decade. Boosted by big data sets, new methods for inference and prediction are transforming many fields of science and technology. How will these developments affect the neurosciences? Bzdok & Yeo (pp2016) take a stab at this question in a wide-ranging review of recent analyses of brain data with modern methods.

Their review paper is organised around four key dichotomies among approaches to data analysis. I will start by describing these dichotomies from my own perspective, which is broadly – though not exactly – consistent with Bzdok & Yeo’s.

Generative versus discriminative models: A generative model is a model of the process that generated the data (mapping from latent variables to data), whereas a discriminative model maps from the data to selected variables of interest.
Nonparametric versus parametric models: Parametric models are specified using a finite number of parameters and thus their flexibility is limited and cannot grow with the amount of data available. Nonparametric models can grow in complexity with the data: The set of numbers identifying a nonparametric model (which may still be called “parameters”) can grow without a predefined limit.
Bayesian versus frequentist inference: Bayesian inference starts by defining a prior over all models believed possible and then infers the posterior probability distribution over the models and their parameters on the basis of the data. Frequentist inference identifies variables of interest that can be computed from data and constructs confidence intervals and decision rules that are guaranteed to control the rate of errors across many hypothetical experimental analyses.
Out-of-sample prediction and generalisation versus within-sample explanation of variance: Within-sample explanation of variance attempts to best explain a given data set (relying on assumptions to account for the noise in the data and control overfitting). Out-of-sample prediction integrates empirical testing of the generalisation of the model to new data (and optionally to different experimental conditions) into the analysis, thus testing the model, including all assumptions that define it, more rigorously.

Generative models are more ambitious than discriminative models in that they attempt to account for the process that generated the data. Discriminative models are often leaner – designed to map directly from data to variables of interest, without engaging all the complexities of the data-generating process.

Nonparametric models are more flexible than parametric models and can adapt their complexity to the amount of information contained in the data. Parametric models can be more stable when estimated with limited data and can sometimes support more sensitive inference when their strong assumptions hold.

Philosophically, Bayesian inference is more attractive than frequentist inference because it computes the probability of models (and model parameters) given the givens (or in Latin: the data). In real life, also, Bayesian inference is what I would aim to roughly approximate to make important decisions, combining my prior beliefs with current evidence. Full Bayesian inference on a comprehensive generative model is the most rigorous (and glamorous) way of making inferences. Explicate all your prior knowledge and uncertainties in the model, then infer the probability distribution over all states of the world deemed possible given what you’ve been given: the data. I am totally in favour of Bayesian analysis for scientific data analysis from the armchair in front of the fireplace. It is only when I actually have to analyse data, at the very last moment, that I revert to frequentist methods.

My only problem with Bayesian inference is my lack of skill. I never finish enumerating the possible processes that might have given rise to the data. When I force myself to stop enumerating, I don’t know how to implement the (incomplete) list of possible processes in models. And if I forced myself to make the painful compromises to implement some of these processes in models, I wouldn’t know how to do approximate inference on the incomplete list of badly implemented models. I would know that many of the decisions I made along the way were haphazard and inevitably subjective, that all them constrain and shape the posterior, and that few of them will be transparent to other researchers. At that point frequentist inference with discriminative models starts looking attractive. Just define easy-to-understand statistics of interest that can efficiently be computed from the data and estimate them with confidence intervals, controlling error probability without relying on subjective priors. Generative-model comparisons, as well, are often easier to implement in the frequentist framework.

Regarding the final dichotomy, out-of-sample prediction using fitted models provides a simple empirical test of generalisation. It can be used to test for generalisation to new measurements (e.g. responses to the same stimuli as in decoding) or to new conditions (as in cross-decoding and in encoding models, e.g. Kay et al. 2008; Naselaris et al. 2009). Out-of-sample prediction can be applied in crossvalidation, where the data are repeatedly split to maximise statistical efficiency (trading off computational efficiency).

Out-of-sample prediction tests are useful because they put assumptions to the test, erring on the safe side. Let’s say we want to test if two patterns are distinct and we believe the noise is multinormal and equal for both conditions. We could use a within-sample method like multivariate analysis of variance (MANOVA) to perform this test, relying on the assumption of multinormal noise. Alternatively, we could use out-of-sample prediction. Since we believe that the noise is multinormal, we might fit a Fisher linear discriminant, which is the Bayes-optimal classifier in this scenario. This enables us to project held-out data onto a single dimension, the discriminant, and use simpler inference statistics and fewer assumptions to perform the test. If multinormality were violated, the classifier would no longer be optimal, making prediction of the labels for held-out data work worse. We would more frequently err on the safe side of concluding that there is no difference, and the false-positives rate would still be controlled. MANOVA, by contrast, relies on multinormality for the validity of the test and violations might inflate the false-positives rate.

More generally, using separate training and test sets is a great way to integrate the cycle of exploration (hypothesis generation, fitting) and confirmation (testing) into the analysis of a single data set. The training set is used to select or fit, thus restricting the space of hypotheses to be tested. We can think of this as generating testable hypotheses. The training set helps us go from a vague hypothesis we don’t know how to test to a specifically defined hypothesis that is easy to test. The reason separate training and test sets are standard practice in machine learning and less widely used in statistics is that machine learning has more aggressively explored complex models that cannot be tested rigorously any other way. More on this here.

Out-of-sample prediction is not the alternative to p values

One thing I found uncompelling (though I’ve encountered it before, see Fig. 1) is the authors’ suggestion that out-of-sample prediction provides an alternative to null-hypothesis significance testing (NHST). Perhaps I’m missing something. In my view, as outlined above, out-of-sample prediction (the use of independent test sets) enables us to use one data set to restrict the hypothesis space (training) and another data set to do inference on the more specific hypotheses, which are easier to test with fewer assumptions. The prediction is the restricted hypothesis space. Just like within-sample analyses, out-of-sample prediction requires a framework for performing inference on the hypothesis space. This framework can be either frequentist (e.g. NHST) or Bayesian.

For example, after fitting encoding models to a training data set, we can measure the accuracy with which they predict the test set. The fitted models have all their parameters fixed, so are easy to test. However, we still need to assess whether the accuracy is greater than 0 for each model and whether one model has greater accuracy than another.

Using up one part of the data to restrict the hypothesis space (fitting) and then using another to perform inference on the restricted hypothesis space (so as to avoid the bias of training set accuracy that results from overfitting) could be viewed as a crude approximation (vacillating between overfitting on one set and using another to correct) to the rigorous thing to do: updating the current probability distribution over all possibilities as each data point is encountered.

Figure 1: I don’t understand why some people think of out-of-sample prediction as an alternative to p values.

Bzdok & Yeo do a good job of appreciating the strengths and weaknesses of either choice of each of these dichotomies and considering ways to combine the strengths even of apparently opposed choices. Four boxes help introduce the key dichotomies to the uninitiated neuroscientist.

The paper provides a useful tour through recent neuroscience research using advanced analysis methods, with a particular focus on neuroimaging. The authors make several reasonable suggestions about where things might be going, suggesting that future data analyses will…

leverage big data sets and be more adaptive (using nonparametric models)
incorporate biological structure
combine the strengths of Bayesian and frequentist techniques
integrate out-of-sample generalisation (e.g. implemented in crossvalidation)

Weaknesses of the paper in its current form

The main text moves over many example studies and adds remarks on analysis methodology that will not be entirely comprehensible to a broad audience, because they presuppose very specialised knowledge.

Too dense: The paper covers a lot of ground, making it dense in parts. Some of the points made are not sufficiently developed to be fully compelling. It would also be good to reflect on the audience. If the audience is supposed to be neuroscientists, then many of the technical concepts would require substantial unpacking. If the audience were mainly experts in machine learning, then the neuroscientific concepts would need to be more fully explained. This is not easy to get right. I will try to illustrate these concerns further below in the “particular comments” section.

Too uncritical: A positive tone is a good thing, but I feel that the paper is a little too uncritical of claims in the literature. Many of the results cited, while exciting for the sophisticated models that are being used, stand and fall with the model assumptions they are based on. Model checking and comparison of many alternative models are not standard practice yet. It would be good to carefully revise the language, so as not to make highly speculative results sound definitive.

No discussion of task-performing models: The paper doesn’t explain what from my perspective is the most important distinction among neuroscience models: Do they perform cognitive tasks? That this distinction is not discussed in detail reflects the fact that such models are still rare in neuroscience. We use a lot of different kinds of model, but even when they are generative and causal and constrained by biological knowledge, they are still just data-descriptive models in the sense that they do not perform any interesting brain information processing. Although they may be stepping stones toward computational theory, such models do not really explain brain computation at any level of abstraction. Task-performing computational models, as introduced by cognitive science, are first evaluated by their ability to perform an information-processing function. Recently, deep neural networks that can perform feats of intelligence such as object recognition have been used to explain brain and behavioural data (Yamins et al. 2013; 2014: Khaligh-Razavi et al. 2014; Cadieu et al. 2014; Guclu & van Gerven 2015). For all their abstractions, many of their architectural features are biologically plausible and at least they pass the most basic test for a computational model of brain function: explaining a computational function (for reviews, see Kriegeskorte 2015; Yamins & DiCarlo 2015).

screenshot1271 Figure 2: Shades of Bayes. The authors follow Kevin Murphy’s textbook in defining degrees of Bayesianity of inference, ranging from maximum likelihood estimation (top) to full Bayesian inference on parameters and hyperparameters (bottom). Above is my slightly modified version.

Comments on specific statements

“Following many new opportunities to generate digitized brain data, uncertainties about neurobiological phenomena henceforth required assessment in the statistical arena.”

Noise in the measurements, not uncertainties about neurobiological phenomena, created the need for statistical inference.

“Finally, it is currently debated whether increasingly used “deep” neural network algorithms with many non-linear hidden layers are more accurately viewed as parametric or non-parametric.”

This is an interesting point. Perhaps the distinction between nonparametric and parametric becomes unhelpful when a model with a finite, fixed, but huge number of parameters is tempered by flexible regularisation. It would be good to add a reference on where this is “debated”.

“neuroscientists often conceptualize behavioral tasks as recruiting multiple neural processes supported by multiple brain regions. This century-old notion (Walton and Paul, 1901) was lacking a formal mathematical model. The conceptual premise was recently encoded with a generative model (Yeo et al., 2015). Applying the model to 10,449 experiments across 83 behavioral tasks revealed heterogeneity in the degree of functional specialization within association cortices to execute diverse tasks by flexible brain regions integration across specialized networks (Bertolero et al., 2015a; Yeo et al., 2015).”

This is an example of a passage that is too dense and lacks the information required for a functional understanding of what was achieved here. The model somehow captures recruitment of multiple regions, but how? “Heterogeneity in the degree of functional specialisation”, i.e. not every region is functionally specialised to exactly the same degree, sounds both plausible and vacuous. I’m not coming away with any insight here.

“Moreover, generative approaches to fitting biological data have successfully reverse engineered i) human facial variation related to gender and ethnicity based on genetic information alone (Claes et al., 2014)”

Fitting a model doesn’t amount to reverse engineering.

“Finally, discriminative models may be less potent to characterize the neural mechanisms of information processing up to the ultimate goal of recovering subjective mental experience from brain recordings (Brodersen et al., 2011; Bzdok et al., 2016; Lake et al., 2015; Yanga et al., 2014).”

Is the ultimate goal to “recover mental experience”? What does that even mean? Do any of the cited studies attempt this?

“Bayesian inference is an appealing framework by its intimate relationship to properties of firing in neuronal populations (Ma et al., 2006) and the learning human mind (Lake et al., 2015).”

Uncompelling. If the brain and mind did not rely on Bayesian inference, Bayesian inference would be no less attractive for analysing data from the brain and mind.

“Parametric linear regression cannot grow more complex than a stiff plane (or hyperplane when more input dimensions Xn) as decision boundary, which entails big regions with identical predictions Y.”

The concept of decision boundary does not make sense in a regression setting.

“Typical discriminative models include linear regression, support vector machines, decision-tree algorithms, and logistic regression, while generative models include hidden Markov models, modern neural network algorithms, dictionary learning methods, and many non-parametric statistical models (Teh and Jordan, 2010).”

Linear regression is not inherently either discriminative or generative, nor are neural networks. A linear regression model is generative when it predicts the data (either in a within-sample framework, such as classical univariate activation-based brain mapping, or in an out-of-sample predictive framework, such as encoding models). It is discriminative when it takes the data as input to predict other variables of interest (e.g. stimulus properties in decoding, or subject covariates).

“Box 4: Null-hypothesis testing and out-of-sample prediction”

As discussed above, this seems to me a false dichotomy. We can perform out-of-sample predictions and test them with null-hypothesis significance testing (NHST). Moreover, Bayesian inference (the counterpart to NHST) can operate on a single data set. It makes more sense to me to contrast out-of-sample prediction versus within-sample explanation of variance.

Brain representations of animal videos are surprisingly stable across tasks and only subtly modulated by attention

[R7I7]

Nastase et al. (pp2016) presented video clips (duration: 2 s) to 12 human subjects during fMRI. In a given run, a subject performed one of two tasks: detecting repetitions of either the animal’s behaviour (eating, fighting, running, swimming) or the category of animal (primate, ungulate, bird, reptile, insect). They performed region-of-interest and searchlight-based pattern analyses. Results suggest that:

The animal behaviours are associated with clearly distinct patterns of activity in many regions, whereas different animal taxa are less discriminable overall. Within-animal-category representational dissimilarities (correlation distances) are similarly large as between-animal-category representational dissimilarities, indicating little clustering by these (very high-level) animal categories. However, animal-category decoding is above chance in a number of visual regions and generalises across behaviours, indicating some degree of linear separability. For the behaviours, there is strong evidence for both category clustering and linear separability (decodability generalising across animal taxa).

Representations are remarkably stable across attentional tasks, but subtly modulated by attention in higher regions. There is some evidence for subtle attentional modulations, which (as expected) appear to enhance task-relevant sensory signals.

Overall, this is a beautifully designed experiment and the analyses are comprehensive and sophisticated. The interpretation in the paper focusses on the portion of the results that confirms the widely accepted idea that task-relevant signals are enhanced by attention. However, the stability of the representations across attentional tasks is substantial and deserves deeper analyses and interpretation.

screenshot1195

Spearman correlations between regional RDMs and behaviour-category RDM (top) and a animal-category RDM (bottom). These correlations measure category clustering in the representation. Note (1) that clustering is strong for behaviours but weak for animal taxa, and (2) that modulations of category clustering are subtly, but significant in several regions, notably in the left postcentral sulcus (PCS) and ventral temporal (VT) cortex.

Strengths

The experiment is well motivated and well designed. The movie stimuli are naturalistic and likely to elicit vivid impressions and strong responses. The two attentional tasks are well chosen as both are quite natural. There are 80 stimuli in total: 5 taxa * 4 behaviours * 2 particular clips * 2 horizontally flipped versions. It’s impossible to control confounds perfectly with natural video clips, but this seems to strike quite a good balance between naturalism and richness of sampling and experimental control.

The analyses are well motivated, sophisticated, well designed, systematic and comprehensive. Analyses include both a priori ROIs (providing greater power through fewer tests) and continuous whole-brain maps of searchlight information (giving rich information about the distribution of information across the brain). Surface-based searchlight hyperalignment based on a separate functional dataset ensures good coarse-scale alignment between subjects (although detailed voxel pattern alignment is not required for RSA). The cortical parcellation based on RDM clustering is also an interesting feature. The combination of threshold-free cluster enhancement and searchlight RSA is novel, as far as I know, and a good idea.

Weaknesses

The current interpretation mainly confirms prevailing bias. The paper follows the widespread practice in cognitive neuroscience of looking to confirm expected effects. The abstract tells us what we already want to believe: that the representations are not purely stimulus driven, but modulated by attention and in a way that enhances the task-relevant distinctions. There is evidence for this in previous studies, for simple controlled stimuli, and in the present study, for more naturalistic stimuli. However, the stimulus, and not the task, explains the bulk of the variance. It would be good to engage the interesting wrinkles and novel information that this experiment could contribute, and to describe the overall stability and subtle task-flexibility in a balanced way.

Behavioural effects confounded with species: Subjects saw a chimpanzee eating a fruit, but they never saw that chimpanzee, or in fact any chimpanzee fighting. The videos showed different animal species in the primate category. Effects of the animal’s behaviour, thus, are confounded with species effects. There is no pure comparison between behaviours within the same species and/or individual animal. It’s impossible to control for everything, but the interpretation requires consideration of this confound, which might help explain the pronounced distinctness of clips showing different behaviours.

Asymmetry of specificity between behaviours and taxa: The behaviours were specific actions, which correspond to linguistically frequent action concepts (eating, fighting, running, swimming). However, the animal categories were very general (primate, ungulate, bird, reptile, insect), and within each animal category, there were different species (corresponding roughly to linguistically frequent basic-level noun concepts). The fact that the behavioural but not the animal categories corresponded to linguistically frequent concepts may help explain the lack of animal-category clustering.

Representational distances were measured with the correlation distance, creating ambiguity. Correlation distances are ambiguous. If they increase (e.g. for one task as compared to another) this could mean (1) the patterns are more discriminable (the desired interpretation), (2) the overall regional response (signal) was weaker, or (3) the noise was greater; or any combination of these. To avoid this ambiguity, a crossvalidated pattern dissimilarity estimator could be used, such as the LD-t (Kriegeskorte et al. 2007; Nili et al. 2014) or the crossnobis estimator (Walther et al. 2015; Diedrichsen et al. pp2016; Kriegeskorte & Diedrichsen 2016). These estimators are also more sensitive (Walther et al. 2015) because, like the Fisher linear discriminant, they benefit from optimal weighting of the evidence distributed across voxels and from noise cancellation between voxels. Like decoding accuracies, these estimators are crossvalidated, and therefore unbiased (in particular, the expected value of the distance estimate is zero under the null hypothesis that the patterns for two conditions are drawn from the same distribution). Unlike decoding accuracies, these distance estimators are continuous and nonsaturating, providing a more sensitive and undistorted characterisation of the representational geometry.

Some statistical analysis details are missing or unclear. The analyses are complicated and not everything is fully and clearly described. In several places the paper states that permutation tests were used. This is often a good choice, but not a sufficient description of the procedure. What was the null hypothesis? What entities are exchangeable under that null hypothesis? What was permuted? What exactly was the test statistic? The contrasts and inferential thresholds could be more clearly indicated in the figures. I did not understand in detail how searchlight RSA and threshold-free cluster enhancement were combined and how map-level inference was implemented. A more detailed description should be added.

Spearman RDM correlation is not optimally interpretable. Spearman RDM correlation is used to compare the regional RDMs with categorical RDMs for either behavioural categories or animal taxa. Spearman correlation is not a good choice for model comparisons involving categorical models, because of the way it deals with ties, of which there are many in categorical model RDMs (Nili et al. 2014). This may not be an issue for comparing Spearman RDM correlations for a single category-model RDM between the two tasks. However, it is still a source of confusion. Since these model RDMs are binary, I suspect that Spearman and Pearson correlation are equivalent here. However, for either type of correlation coefficient, the resulting effect size depends on the proportion of large distances in the model matrix (30 of 190 for the taxonomy and 40 of 190 for the behavioural model). Although I think it is equivalent for the key statistical inferences here, analyses would be easier to understand and effect sizes more interpretable if differences between averages of dissimilarities were used.

Suggestions

In general, the paper is already at a high level, but the authors may consider making improvements addressing some of the weaknesses listed above in a revision. I have a few additional suggestions.

Open data: This is a very rich data set that cannot be fully analysed in a single paper. The positive impact on the field would be greatest if the data were made publicly available.

Structure and clarify the results section: The writing is good in general. However, the results section is a long list of complex analyses whose full motivation remains unclear in some places. Important details for basic interpretation of the results should be given before stating the results. It would be good to structure the results section according to clear claims. In each subsection, briefly state the hypothesis, how the analysis follows from the hypothesis, and what assumptions it depends on, before describing the results.

Compare regional RDMs between tasks without models: It would be useful to assess whether representational geometries change across tasks without relying on categorical model RDMs. To this end the regional RDMs (20×20 stimuli) could be compared between tasks. A good index to be computed for each subject would be the between-task RDM correlation minus the within-task RDM correlation (both across runs and matched to equalise the between run temporal separation). Inference could use across subject nonparametric methods (subject as random effect). This analysis would reveal the degree of stability of the representational geometry across tasks.

Linear decoding generalising across tasks: It would be good to train linear decoders for behaviours and taxa in one task and test for generalisation to the other task (and simultaneously across the other factor).

Independent definition of ROIs: Might the functionally driven parcellation of the cortex and ROI selection based on intersubject searchlight RDM reliability not bias the ROI analyses? It seems safer to use independently defined ROIs.

Task decoding: It would be interesting to see a searchlight maps of task decodability. Training and test sets should always consist of different runs. One could assess generalisation to new runs and ideally also generalisation across behaviours and taxa (leaving out one animal category or one behavioural category from the training set).

Further investigate the more prominent distinctions among behaviours than among taxa: Is this explained by a visual similarity confound? Cross-decoding of behaviour between taxa sheds some light on this. However, it would be good also to analyse the videos with motion-energy models and look at the representational geometries in such models.

Additional specific comments and questions

Enhancement and collapse have not been independently measured. The abstract states: “Attending to animal taxonomy while viewing the same stimuli increased the discriminability of distributed animal category representations in ventral temporal cortex and collapsed behavioural information.” Similarly, on page 12, it says: “… accentuating task-relevant distinctions and reducing unattended distinctions.”
This description is intuitive, but it incorrectly suggests that the enhancement and collapse have been independently measured. This is not the case: It would require a third, e.g. passive-viewing condition. Results are equally consistent with the interpretation that attention just enhances the task-relevant distinctions (without collapsing anything). Conversely, the reverse might also be consistent with the results shown: that attention just collapses the task-irrelevant distinctions (without enhancing anything).

You explain in the results that this motivates the use of the term categoricity, but then don’t use that term consistently. Instead you describe it as separate effects, e.g. in the abstract.

The term categoricity may be problematic for other reasons. A better description might be “relative enhancement of task-relevant representational distinctions”. Results would be more interpretable if crossvalidated distances were used, because this would enable assessment of changes in discriminability. By contrast, larger correlation distance can also result from reduced responses or nosier data.

Map-inferential thresholds are not discernable: In Fig. 2, all locations with positively model-correlated RDMs are marked in red. The locations exceeding the map-inferential threshold are not visible because the colour scale uses red for below- and above-threshold values. The legend (not just the methods section) should also clarify whether multiple testing was corrected for and if so how. The Fig. 2 title “Effect of attention on local representation of behaviour and taxonomy” is not really appropriate, since the inferential results on that effect are in Fig. S3. Fig. S3 might deserve to be in the main paper, given that the title claim is about task effects.

Videos on YouTube: To interpret these results, one really has to watch the 20 movies. How about putting them on YouTube?

Previous work: The work of Peelen and Kastner and of Sigala and Logothetis on attentional dependence of visual representations should be cited and discussed.

Colour scale: The jet colour scale is not optimal in general and particularly confusing for the model RDMs. The category model RDMs for behaviour and taxa seem to contain zeroes along the diagonal, ones for within-category comparisons and twos for between-category comparisons. Is the diagonal excluded from model? In that case the matrix is binary, but contains ones and twos instead of zeroes and ones. While this doesn’t matter for the correlations, it is a source of confusion for readers.

Show RDMs: To make results easier to understand, why not show RDMs? Could average certain sets of values for clarity.

Statistical details

“When considering all surviving searchlights for both attention tasks, the mean regression coefficient for the behavioural category target RDM increased significantly from 0.100 to 0.129 (p = .007, permutation test).”
Unclear: What procedure do these searchlights “survive”? Also: what is the null hypothesis? What is permuted? Are these subject RFX tests?

The linear mixed effects model of Spearman RDM correlations suggests differences between regions. However, given the different noise levels between regions, I’m not sure these results are conclusive (cf. Diedrichsen et al. 2011).

“To visualize attentional changes in representational geometry, we first computed 40 × 40 neural RDMs based on the 20 conditions for both attention tasks and averaged these across participants.”
Why is the 40×40 RDM (including, I understand, both tasks) ever considered? The between-task pattern comparisons are hard to interpret because they were measured in different runs (Henriksson et al. 2015; Alink et al. pp2015).

“Permutation tests revealed that attending to animal behaviour increased correlations between the observed neural RDM and the behavioural category target RDM in vPC/PM (p = .026), left PCS (p = .005), IPS (p = .011), and VT (p = .020).”
What was the null? What was permuted? Do these survive multiple testing correction? How many regions were analysed?

Fig. 3: Bootstrapped 95% confidence intervals. What was bootstrapped? Conditions?

page 14: Mean of searchlight regression coefficients – why only select those searchlights that survive TFCE in both attention conditions?

page 15: Parcellation of ROIs based on the behaviour attention data only. Why?

SI text: Might eye movements constitute a confound? (Free viewing during video clips)

“more unconfounded” -> “less confounded”

Acknowledgement
Thanks to Marieke Mur for discussing this paper with me and sharing her comments, which are included above.

— Nikolaus Kriegeskorte

	Anna Ivanova on What type of linear or nonline…
	BCI papers – T… on From bidirectional brain-compu…
	Firsts: Submitting a… on The selfish scientist’s…
	On optimal measures… on What’s the best measure of rep…
	What the “Kany… on What’s the best measure of rep…