Different categorical divisions become prominent at different latencies in the human ventral visual representation



[Below is my secret peer review of Cichy, Pantazis & Oliva (2014). The review below applies to the version as originally submitted, not to the published version that the link refers to. Several of the concrete suggestions for improvements below were implemented in revision. Some of the more general remarks on results and methodology remain relevant and will require further studies to completely resolve. For a brief summary of the methods and results of this paper, see Mur & Kriegeskorte (2014).]

This paper describes an excellent project, in which Cichy et al. analyse the representational dynamics of object vision using human MEG and fMRI on a set of 96 object images whose representation in cortex has previously been studied in monkeys and humans. The previous studies provide a useful starting point for this project. However, the use of MEG in humans and the combination of MEG and fMRI enables the authors to characterise the emergence of categorical divisions at a level of detail that has not previously been achieved. The general approaches of MEG-decoding and MEG-RSA pioneered by Thomas Carlson et al. (2013) are taken to new level here by using a richer set of stimuli (Kiani et al. 2007; Kriegeskorte et al. 2008). The experiment is well designed and executed, and the general approach to analysis is well-motivated and sound. The results are potentially of interest to a broad audience of neuroscientists. However, the current analyses lack some essential inferential components that are necessary to give us full confidence in the results, and I have some additional concerns that should be addressed in a major revision as detailed below.



(1) Confidence-intervals and inference for decoding-accuracy, RDM-correlation time courses and peak-latency effects

Several key inferences depend on comparing decoding accuracies or RDM correlations as a function of time, but the reliability of these estimates is not assessed. The paper also currently gives no indication of the reliability of the peak latency estimates. Latency comparisons are not supported by statistical inference. This makes it difficult to draw firm conclusions. While the descriptive analyses presented are very interesting and I suspect that most of the effects the authors highlight are real, it would be good to have statistical evidence for the claims. For example, I am not confident that the animate-inanimate category division peaks at 285 ms. This peak is quite small and on top of a plateau. Moreover, the time the category clustering index reaches the plateau (140 ms) appears more important. However, interpretation of this feature of the time course, as well, would require some indication of the reliability of the estimate.

I am also not confident that the RDM-correlation between the MEG and V1-fMRI data really has a significantly earlier peak than the RDM-correlation between the MEG and IT-fMRI data. This confirms our expectations, but it is not a key result. Things might be more complicated. I would rather see unexpected result of a solid analysis than an expected result of an unreliable analysis.

Ideally, adding 7 more subjects would allow random effects analyses. All time courses could then be presented with error margins (across subjects, supporting inference to the population by treating subjects as a random-effect dimension). This would also lend additional power to the fixed-effects inferential analyses.

However, if the cost of adding 7 subjects is considered too great, I suggest extending the approach of bootstrap resampling of the image set. This would provide reliability estimates (confidence intervals) for all accuracy estimates and peak latencies and support testing peak-latency differences. Importantly, the bootstrap resampling would simulate the variability of the estimates across different stimulus samples (from a hypothetical population of isolated object images of the categories used here). It would, thus, provide some confidence that the results are not dependent on the image set. Bootstrap resampling each category separately would ensure that all categories are equally represented in each resampling.

In addition, I suggest enlarging the temporal sliding window in order to stabilise the time courses, which look a little wiggly and might give unstable estimates of magnitudes and latencies across bootstrap samples otherwise – e.g. the 285 ms animate-inanimate discrimination peak. This will smooth the time courses appropriately and increase the power. A simple approach would be to use a bigger time steps as well, e.g. 10- or 20-ms bins. This would provide more power in Bonferroni correction across time. Alternatively, the false-discovery rate could be used to control false positives. This would work equally well for overlapping temporal windows (e.g. 20-ms window, 1 ms steps).


(2) Testing linear separability of categories

The present version of the analyses uses averages of pairwise stimulus decoding accuracies. The decoding accuracies serve as measures of representational discriminability (a particular representational distance measure). This is fine and interpretable. The average between minus the average within discriminability is a measure of clustering, which is a stronger result in a sense than linear decodability. However, it would be good to see is linear decoding of each category division reveals additional or earlier effects. While your clustering index essentially implies linear separability, the opposite is not true. For example, two category regions arranged like stacked pancakes could be perfectly linearly discriminable while having no significant clustering (i.e. difference between the within and between category discriminabilities of image pairs). Like this:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Each number indexes a category and each repetition represents an exemplar. The two lines illustrate the pancake situation. If the vertical separation of the pancakes is negligible, they are perfectly linearly discriminable, despite a negligible difference between the average within and the average between distance. It would be interesting to see these linear decoding analyses performed using either indendent response measurements for the same stimuli or an independent (held out) set of stimuli as the test set. This would more profoundly address different aspects of the representational geometry.

For the same images in the test set, pairwise discriminability in a high-dimensional space strongly suggests that any category dichotomy can be linearly decoded. Intuitively, we might expect the classifier to generalise well to the extent that the categories cluster (within distances < between distances) – but this need not be the case (e.g. the pancake scenario might also afford good generalization to new exemplars).


(3) Circularity of peak-categorical MDS arrangements

The peak MDS plots are circular in that they serve to illustrate exactly the effect (e.g. animate-inanimate separation) that the time point has been chosen to maximise. This circularity could easily be removed by selecting the time point for each subject based on the other subjects’ data. The accuracy matrices for the selected time points could then be averaged across subjects for MDS.


(4) Test individual-level match of MEG and fMRI

It is great that fMRI and MEG data was acquired in the same participants. This suggests an analysis of the consistent reflection of individual idiosyncrasies in object processing in fMRI and MEG. One way to investigate this would be to correlate single-subject RDMs between MEG and fMRI, within and between subjects. If the within-subject MEG-fMRI RDM correlation is greater (at any time point), then MEG and fMRI consistently reflect individual differences in object processing.



Why average trials? Decoding accuracy means nothing then. Single-trial decoding accuracy and information in bits would be interesting to see and useful to compare to later studies.

The stimulus-label randomisation test for category clustering (avg(between)-avg(within)) is fine. However, the bootstrap test as currently implemented might be problematic.

“Under the null hypothesis, drawing samples with replacement from D0 and calculating their mean decoding accuracy daDo, daempirial should be comparable to daD0. Thus, assuming that D has N labels (e.g. 92 labels for the whole matrix, or 48 for animate objects), we draw N samples with replacement and compute the mean daD0 of the drawn samples.”

I understand the motivation for this procedure and my intuition is that this test is likely to work, so this is a minor point. However, subtracting the mean empirical decoding accuracy might not be a valid way of simulating the null hypothesis. Accuracy is a bounded measure and its distribution is likely to be wider under the null than under H1. The test is likely to be valid, because under H0 the simulation will be approximately correct. However, to test if some statistic significantly exceeds some fixed value by bootstrapping, you don’t need to simulate the null hypothesis. Instead, you simulate the variability of the estimate and obtain a 95%-confidence interval by bootstrapping. If the fixed value falls outside the interval (which is only going to happen in about 5% of the cases under H0), then the difference is significant. This seems to me a more straightforward and conventional test and thus preferable. (Note that this uses the opposite tail of the distribution and is not equivalent here because the distribution might not be symmetrical.)

Fig. 1: Could use two example images to illustrate the pairwise classification.

It might be good here to see real data in the RDM on the upper right (for one time point) to illustrate.

“non-metric scaling (MDS)”: add multidimensional

Why non-metric? Metric MDS might more accurately represent the relative representational distances.

“The results (Fig. 2a, middle) indicate that information about animacy arises early in the object processing chain, being significant at 140 ms, with a peak at 285 ms.”
I would call that late – relative to the rise of exemplar discriminability.

“We set the confidence interval at p=”
This is not the confidence interval, this is the p value.

“To our knowledge, we provide here the first multivariate (content-based) analysis of face- and body-specific information in EEG/MEG.”

The cited work by Carlson et al. (2012) shows very similar results – but for fewer images.

“Similarly, it has not been shown previously that a content-selective investigation of the modulation of visual EEG/MEG signals by cognitive factors beyond a few everyday categories is possible24,26.”
Carlson et al. (2011, 2012; refs 22,23) do show similar results.

I don’t understand what you mean by “cognitive factors” here.

Fig S3d
What are called “percentiles”, are not percentiles: multiply by 100.

“For example, a description of the temporal dynamics of face representation in humans might be possible with a large and rich parametrically modulated stimulus set as used in monkey electrophysiology 44.”
Should cite Freiwald (43), not Shepard (44) here.



Although the overall argument is very compelling, there were a number of places in the manuscript where I came across weaknesses of logic, style, and grammar. The paper also had quite a lot of typos. I list some of these below, to illustrate, but I think the rest of the text could use more work as well to improve precision and style.

One stylistic issue is that the paper switches between present and past tense without proper motivation (general statement versus report of procedures used in this study).

Abstract: “individual image classification is resolved within 100ms”

Who’s doing the classification here? The brain or the researchers? Also: discriminating two exemplars is not classification (except in a pattern-classifier sense). So this is not a good way to state this. I’d say individual images are discriminated by visual representations within 100 ms.

“Thus, to gain a detailed picture of the brain’s [] in object recognition it is necessary to combine information about where and when in the human brain information is processed.”

Some phrase missing there.

“Characterizing both the spatial location and temporal dynamics of object processing

demands innovation in neuroimaging techniques”

A location doesn’t require characterisation, only specification. But object processing does not take place in one location. (Obviously, the reader may guess what you mean — but you’re not exactly rewarding him or her for reading closely here.)

“In this study, using a similarity space that is common to both MEG and fMRI,”
Style. I don’t know how a similarity space can be common to two modalities. (Again, the meaning is clear, but please state it clearly nevertheless.)

“…and show that human MEG responses to object correlate with the patterns of neuronal spiking in monkey IT”

“1) What is the time course of object processing at different levels of categorization?”
Does object processing at a given level of categorisation have a single time course? If not, then this doesn’t make sense.

“2) What is the relation between spatially and temporally resolved brain responses in a content-selective manner?”
This is a bit vague.

“The results of the classification (% decoding accuracy, where 50% is chance) are stored in a 92 × 92 matrix, indexed by the 92 conditions/images images.”
images repeated

“Can we decode from MEG signals the time course at which the brain processes individual object images?”
Grammatically, you can say that the brain processes images “with” a time course, not “at” a time course. In terms of content, I don’t know what it means to say that the brain processes image X with time course Y. One neuron or region might respond to image X with time course Y. The information might rise and fall according to time course Y. Please say exactly what you mean.
“A third peek at ~592ms possibly indicates an offset-response”
Peek? Peak.

“Thus, multivariate analysis of MEG signals reveal the temporal dynamics of visual content processing in the brain even for single images.”

“This initial result allows to further investigate the time course at which information about membership of objects at different levels of categorization is decoded, i.e. when the subordinate, basic and superordinate category levels emerge.”
Unclear here what decoding means. Are you suggesting that the categories are encoded in the images and the brain decodes them? And you can tell when this happens by decoding? This is all very confusing. 😉

“Can we determine from MEG signals the time course at which information about membership of an image to superordinate-categories (animacy and naturalness) emerges in the brain?”
Should say “time course *with* which”. However, all the information is there in the retina. So it doesn’t really make sense to say that the information emerges with a time course. What is happening is that category membership becomes linearly decodable, and thus in a sense explicit, according to this time course.

“If there is information about animacy, it should be mirrored in more decoding accuracy for comparisons between the animate and inanimate division than within the average of the animate and inanimate division.”

more decoding accuracy -> greater decoding accuracy

“within the average” Bad phrasing.

“utilizing the same date set in monkey electrophysiology and human MRI”
–> stimulus set

“Boot-strapping labels tests significance against chance”
Determining significance is by definition a comparison of the effect estimate against what would be expected by chance. It therefore doesn’t make sense to “test significance against chance”.

“corresponding to a corresponding to a p=2.975e-5 for each tail”
Redundant repetition.

“Given thus [?] a fMRI dissimilarity matrices [grammar] for human V1 and IT each, we calculate their similarity (Spearman rank-order correlation) to the MEG decoding accuracy matrices over time, yielding 2nd order relations (Fig. 4b).”

“We recorded brain responses with fMRI to the same set of object images used in the MEG study *and the same participants*, adapting the stimulation paradigm to the specifics of fMRI (Supplementary Fig. 1b).”

The effect is significant in IT (p<0.001), but not in V1 (p=0.04). Importantly, the effect is significantly larger in IT than in V1 (p<0.001).

p=0.04 is also significant, isn’t it? This is very similar to Kriegeskorte et al. (2008, Fig. 5A), where the animacy effect was also very small, but significant in V1.

“boarder” -> “border”


Using performance-driven deep learning models to understand sensory cortex

In a new perspective piece in Nature Neuroscience, Yamins & Dicarlo (2016) discuss the emerging methodology and initial results in the literature of using deep neural nets with millions of parameters optimised for task performance to explain representations in sensory cortical areas. These are important developments. The authors explain the approach very well, also covering the historical progression toward it and its future potential.  Here are the key features of the approach as outlined by the authors.

(1) Complex models with multiple stages of nonlinear transformation from stimulus to response are used to explain high-level brain representations. The models are “stimulus computable” in the sense of fully specifying the computations from a physical description of the stimulus to the brain responses (avoiding the use of labels or judgments provided by humans).

(2) The models are neurobiologically plausible and “mappable”, in the sense that their components are thought to be implemented in specific locations in the brain. However, the models abstract from many biological details (e.g. spiking, in the reviewed studies).

(3) The parameters defining a model are specified by optimising the model’s performance at a task (e.g. object recognition). This is essential because deep models have millions of parameters, orders of magnitude too many to be constrained by the amounts of brain-activity data that can be acquired in a typical current study.

(4) Brain-activity data may additionally be used to define affine transformations of the model representations, so as to (a) fine-tune the model to explain the brain representations and/or (b) define the relationship between model units and measured brain responses in a particular individual.

(5) The resulting model is tested by evaluating the accuracy with which it predicts the representation of a set of stimuli not used in fitting the model. Prediction accuracy can be assessed at different levels of description:

  1. as the accuracy of prediction of a stimulus-response matrix,
  2. as the accuracy of prediction of a representational dissimilarity matrix, or
  3. as the accuracy of prediction of task-information decodability (i.e. are the decoding accuracies for a set of stimulus dichotomies correlated between model and neural population?).

A key insight is that the neural-predictive success of the models derives from combining constraints on architecture and function.

  • Architecture: Neuroanatomical and neurophysiological findings suggest (a) that units should minimally be able to compute linear combinations followed by static nonlinearities and (b) that their network architecture should be deep with rich multivariate representations at each stage. 
  • Function: Biological recognition performance and informal characterisations of high-level neuronal response properties suggest that the network should perform a transformation that retains high-level sensory information, but also emphasises behaviourally relevant categories and semantic dimensions. Large sets of labelled stimuli provide a constraint on the function to be computed in the form of a lookup table.

Bringing these constraints together has turned out to enable the identification of models that predict neural responses throughout the visual hierarchies better than any other currently available models. The models, thus, generalise not just to novel stimuli (Yamins et al. 2014; Khaligh-Razavi & Kriegeskorte 2014; Cadieu et al. 2014), but also from the constraints imposed on the mapping (e.g. mapping images to high-level categories) to intermediate-level representational stages (Güçlü & van Gerven 2015; Seibert et al. PP2016). Similar results are beginning to emerge for auditory representations.

The paper contains a useful future outlook, which is organised into sections considering improvements to each of the three components of the approach:

  • model architecture: How deep, what filter sizes, what nonlinearities? What pooling and local normalisation operations?
  • goal definition: What task performance is optimised to determine the parameters?
  • learning algorithm: Can learning algorithms more biologically plausible than backpropagation and potentially combining unsupervised and supervised learning be used?

In exploring alternative architectures, goals, and learning algorithms, we need to be guided by the known neurobiology and by the computational goals of the system (ultimately the organism’s survival and reproduction). The recent progress with neural networks in engineering provides the toolbox for combining neurobiologically plausible components and setting their parameters in a way that supports task performance. Alternative architectures, goals, and learning algorithms will be judged by their ability to predict neural representations of novel stimuli and biological behaviour.

The final section reflects on the fact that the feedfoward deep convolutional models currently very successful in this area only explain the feedforward component of sensory processing. Recurrent neural net models, which are also rapidly conquering increasingly complex tasks in engineering applications, promise to address these limitations of the initial studies using deep nets to explain sensory brain representations.

This perspective paper will be of interest to a broad audience of neuroscientists not themselves working with complex computational models, who are looking for a big-picture motivation of the approach and review of the most important findings. It will also be of interest to practitioners of this new approach, who will value the historical review and the careful motivation of each of the components of the methodology.


Do view-invariant brain representations of actions arise within 200 ms of viewing?


Humans can rapidly visually recognise the actions people around them are engaged in and this ability is important for successful behaviour and social interaction. Isik et al. presented human subjects with 2-second video clips of humans performing actions while measuring brain activity with MEG. The clips comprised 5 actions (walk, run, jump, eat, drink) performed by each of five different actors and video-recorded from each of five different views (only frontal and profile used in MEG). Results show that action can be decoded from MEG signals arising about 200 ms after the onset of the video, with decoding accuracy peaking after about 500 ms and then decaying while the stimulus is still on, with a rebound after stimulus offset. Moreover, decoders generalise across actors and views. The authors conclude that the brain rapidly computes a representation that is invariant to view and actor.


Figure from the paper. Legend from the paper with my modifications in brackets: [Accuracy of action decoding (%) from MEG data as a function of time after video onset]. We can decode [which of five actions was being performed in the video clip] by training and testing on the same view (‘within-view’ condition), or, to test viewpoint invariance, training on one view (0 degrees [frontal, I think, but this should be clarified] or 90 degrees [profile]) and testing on the second view (‘across view’ condition). Results are each [sic] from the average of eight different subjects. Error bars represent standard deviation [across what?]. Horizontal line indicates chance decoding accuracy. […] Lines at the bottom of plot indicate significance with p<0.01 permutation test, with the thickness of the line indicating [for how many of the 8 subjects decoding was significant]. [Note the significant offset response after the 2-s video (whose duration should be indicated by a stimulus bar).]


The rapid view-invariant action decoding is really quite amazing. It would be good to see more detailed analyses to assess the nature of the signals enabling this decoding feat. Of course, 200 ms already allows for recurrent computations and the decodability peak is at 500 ms, so this is not strong evidence for a pure feedforward account.

The generalisation across actors is less surprising. This was a very controlled data set. Despite some variation in the appearance of the actors, it seems plausible that there would be some clustering of the vectors of pixels across space and time (or of features of a low-level visual representation) corresponding to different actors performing the same action seen from the same angle.

In separate experiments, the authors used static single frames taken from the videos and dynamic point-light figures as stimuli. These reduced form-only and motion-only stimuli were associated with diminished separation of actions in the human brain and in model representations, and with diminished human action recognition, suggesting that form and motion information are both essential to action recognition.

I’m wondering about the role of task-related priors. Subjects were performing an action recognition task on this controlled set of brief clips during MEG while freely viewing the clips (though this is not currently clearly stated). This task is likely to give rise to strong prior expectations about the stimulus (0 deg or 90 deg, one of five actions, known scale and positions of key features for action discrimination). Primed to attend to particular diagnostic features and to fixate in certain positions, the brain will configure itself for rapid dynamic discrimination among the five possible actions. The authors present a group-average analysis of eye movements, suggesting that these do not provide as much information about the actions as the MEG signal. However, the low-dimensional nature of the task space is in contrast to natural conditions, where a wider variety of actions can be observed and view, actor, size, and background vary more. The precise prior expectations might contribute to the rapid discriminability of the actions in the brain signals.

The authors model the results in the framework of feedforward processing in a Hubel-and-Wiesel/Poggio-style model that alternates convolution and max-pooling to simulate responses resembling simple and complex cells, respectively. This model is extended here to process video using spatiotemporal filter templates. The first layer uses Gabor filters, higher layers use templates in the first layer matching video clips in the stimulus set. The authors argue that this model supports invariant decoding and largely accounts for the MEG results.

Like the subjects, the model is set up to process the restricted stimulus space. The internal features of the model were constructed using representational fragments from samples from the same parametric space of videos. The exact videos used to test the models were not used for constructing the feature set. However, if I understood correctly, videos from the same restricted space (5 actions, 5 actors, 5 views) were used. Whether the model can be taken to explain (at a high level of abstraction) the computations performed by the brain depends critically on the degree to which the model is not specifically constructed for the (necessarily very limited) 5-action controlled stimulus space used in the study.

As the authors note, humans still outperform computer vision models at action recognition. How does the authors’ own model perform on less controlled action videos? If it the model cannot perform the task on real-world sensory input, can we be confident that it captures the way that the human brain performs the task? This is a concern in many studies and not trivial to address. However, the interpretation of the results should engage this issue.



  • Controlled stimulus set: The set of video stimuli (5 actions x 5 actors x 5 views x 26 clips = 3250 2-sec clips) is impressive. Assembling this set is an important contribution to the field. The set is condition-rich (compared to typical stimulus sets used in cognitive neuroscience studies) and seems to strike a good balance between control and realism. This set could be a driver of progress if it were to be used in multiple modelling and empirical studies.
  • Combination of brain-activity measurements and a simple computational model, which provides a useful starting point for modelling the recognition of dynamic actions, as it is minimal and standard in many respects: a feedforward model in the HMAX framework, extended from spatial to spatiotemporal filters.



  • Controlled stimulus set: The set of video stimuli is very restricted compared to real-world action recognition. For the brain data, this means that subjects might have rapidly formed priors about the stimuli, enabling them to configure their visual systems (e.g. attentional templates, fixation targets) for more rapid recognition of the 5 actions than is possible in real-world settings. This limitation is shared with many studies in our field and difficult to overcome without giving up control (which is a strength, see above). I therefore suggest addressing this problem in the discussion.
  • The model uses features based on spatiotemporal patterns sampled from the same restricted stimulus space. Although non-identical clips were used, the videos underlying the representational space appear to share a lot with the experimental stimuli (same 5 actions, same 5 views, same background?, same actors?). I would therefore not expect this model to work well on arbitrary real-world action video clips. This is in contrast to recent studies using deep convolutional neural nets (e.g. Khaligh-Razavi & Kriegeskorte 2014), where the models were trained without any information about the (necessarily restricted) brain-experimental stimulus set and can perform recognition under real-world conditions.
  • Only one model (in two variants) is tested. In order to learn about computational mechanism, it would be good to test more models.
  • MEG data were acquired during viewing of only 50 of the clips (5 actions x 5 actors x 2 views).
  • Missing inferential analyses: While the authors employ inferential analyses in single subjects and report number of significant subjects, few hypotheses of interest are formally statistically tested. The effects interpreted appear quite strong, so the results described above appear solid nevertheless (interpretational caveats notwithstanding).


Overall evaluation

This is an ambitious study describing results of a well-designed experiment using a stimulus set that is a major step forward. The results are an interesting and substantial contribution to the literature. However, the analyses could be much richer than they currently are and the interpretation of the results is not straightforward. Stimulus-set-induced priors may have affected both the neural processing measured and the model (which used templates from stimuli within the controlled video set). Results should be interpreted more cautiously in this context.

Although feedforward processing is an important part of the story, it is not the whole story. Recurrent signal flow is ubiquitous in the brain and essential to brain function. In engineering, similarly, recurrent neural networks are beginning to dominate spatiotemporal processing challenges such as speech and video recognition. The fact that the MEG data are presented as time courses, revealing a rich temporal structure, and the model analyses are bar graphs illustrates the key limitation of the model.

It would be great to extend the analyses to reveal a richer picture of the temporal dynamics. This should include an analysis of the extent to which each model layer can explain the representational geometry at each latency from stimulus onset.


Future directions

In revision or future studies, this line of work could be extended in a number of ways:

  • Use multiple models that can handle real-world action videos. The authors’ controlled video set is extremely useful for testing human and model representations, and for comparing humans to models. However, to be able to draw strong conclusions, the models, like the humans, would have to be trained to recognise human actions under real-world conditions (unrestricted natural video). In addition, it would be good to compare the biological representational dynamics to both feedforward and recurrent computational models.
  • To overcome the problem of stimulus-set related priors, which make it difficult to compare representational dynamics measured for restricted stimulus sets to real-world recognition in biological brains, one could present a large set of stimuli without ever presenting a stimulus twice to the same subject. Would the action category still be decodable at 150 ms with generalisation across views? Would a feedforward computer vision model trained on real-world action videos be able to predict the representational dynamics?
  • The MEG analyses could use source reconstruction to enable separate analyses of the representational dynamics in different brain regions.
  • It would be useful to have MEG data for the full stimulus set of 5 actions x 5 actors x 5 viewpoints = 125 conditions. The representational geometries could be analysed in detail to reveal which particular action pairs become discriminable when with what level of invariance.



Particular suggestions for improvements of this paper

(1) Present more detailed results

It would be good to see results separately for each pair of actions and each direction of crossdecoding (0 deg training -> 90 deg testing, and 90 deg training -> 0 deg testing). Regarding the former, eating and drinking involve very similar body postures and motions. Is this reflected in the discriminability of these actions?

Regarding, the decoding generalisation across views, you state:

“We decoded by training only on one view (0 degrees or 90 degrees), and testing on a second view (0 degrees or 90 degrees).”

Was the training set exclusively composed on 0 degree (frontal?) and the test set exclusively of 90 degree (side view?), and vice versa? In case the test set contained instances of both views (though of course, not for the same actor and action), results are more difficult to interpret.


(2) Discuss the caveats to the current interpretation of the results

Discuss the question whether priors resulting from subjects understanding of the restricted stimulus set might have affected the processing of the stimuli. Consider the involvement of recurrent computations within 200 ms and discuss the continuing rise of decodability until 500 ms. Discuss the possibility that the model will not generalise to action recognition in the wild.


(3) Test several control models

Can Gabor, HMAX, and deep convolutional neural net models support similarly invariant action decoding? These models are relatively easy to test, so I think it’s worth considering this for revision. Computer vision models trained on dynamic action recognition could be left to future studies.


(4) Test models by comparison of its representations with the brain representations

The computational model is currently only compared to the data at the very abstract level of decoding accuracy. Can the model predict the representations and representational dynamics in detail? It might be difficult to use the model to predict the measured channels. This would require the fitting of a linear model predicting the measured channels from the model units and the MEG data (acquired for only 5 actions x 5 actors x 2 views = 50 conditions) might be insufficient. However, representational dynamics could be investigated in the framework of representational similarity analysis (50 x 50 representational dissimilarity matrices) following Carlson et al. (2013) and Cichy et al. (2014). Note that this approach does not require fitting a prediction model and so appears applicable here. Either approach would reveal the dynamic prediction of the feedforward model (given dynamic inputs) and where its prediction diverges from the more complex and recurrent processes in the brain. This would promise to give us a richer and less purely confirmatory picture of the data and might show the merits and limitations of a feedforward account.


(5) Perform temporal cross-decoding

Temporal crossdecoding (Carlson et al. 2013, Cichy et al. 2014) could be used to more richly characterise the representational dynamics. This would reveal whether representations stabilise in certain time windows, or keep tumbling through the representational space even as stimuli are continuously decodable.


(6) Improve the inferential analyses

I don’t really understand the inference procedure in detail from the description in the methods section.

“We recorded the peak decoding accuracy for each time bin,…”

What is the peak decoding accuracy for each time bin? Is this the maximum accuracy across subjects for each time bin?

“…and used the null distribution of peak accuracies to select a threshold where decoding results performing above all points in the null distribution for the corresponding time point were deemed significant with P < 0.01 (1/100).”

I’m confused after reading this, because I don’t understand what is meant by “peak”.

The inference procedure for the decoding-accuracy time courses seems to lack formal multiple-testing correction across time points. Given enough subjects, inference could be performed with subject as a random effect. Alternatively, fixed-effects inference could be performed by permutation, averaging across subjects. Multiple testing across latencies should be formally corrected for. A simple way to do this is to relabel the experimental events once, compute an entire decoding time course, and record the peak decoding accuracy across time (or if this is what was done, it should be clearly described). Through repeated permutations, a null distribution of peak accuracies can be constructed and a threshold selected that is exceeded anywhere under H0 with only 5% probability, thus controlling the familywise error rate at 5%. This threshold could be shown as a line or as the upper edge of a transparent rectangle that partially obscures the insignificant part of the curve.

For each inferential analysis, please describe exactly what the null hypothesis was, what event-labels are exchangeable under this null hypothesis, and how the null distribution was computed. Also, explain how the permutation test interacted with the crossvalidation procedure. The crossvalidation should ideally generalise to new stimuli and label permutation be wrapped around this entire procedure.

“Decoding analysis was performed using cross validation, where the classifier was trained on a randomly selected subset of 80% of data for each stimulus and tested on the held out 20%, to assess the classifier’s decoding accuracy.”

Does this apply only to the within-view decoding? In the critical decoding analysis with generalisation across views, it cannot have been 20% of the data in the held-out set, since 0-deg views were used for training and 90-deg views for testing (and vice versa). If only 50% of the data were used for training there, why didn’t performance suffer given the smaller training set compared to the within-view decoding?

It would also be good to have estimates and inferential comparisons of the onset and peak latencies of the decoding time courses. Inference could be performed on a set of single-subject latency differences between two conditions modelling subject as a random effect.


(7) Qualify claims about biological fidelity of the model

The model is not really “designed to closely mimic the biology of the visual system”, rather its architecture is inspired by some of the features of the feedforward component of the visual hierarchy, such as local receptive fields of increasing size across a hierarchy of representations.


(8) Open stimuli and data

This work would be especially useful to the community if the video stimuli and the MEG data were made openly available. To fully interpret the findings, it would also be good to be able to view the movie clips online.


(9) Further clarify the title

The title “Fast, invariant representation for human action in the visual system” is somewhat unclear. What is meant are representations of perceived human actions, not representations for action. “Fast, invariant representation of visually perceived human actions” would be better, for example.


(10) Clarify what stimuli MEG data were acquired for

The abstract states “We use magnetoencephalography (MEG) decoding and a computational model to study action recognition from a novel dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by five actors at five viewpoints.” This suggests that MEG data were acquired for all these conditions. The methods section clarifies that MEG data were only recorded for 50 conditions (5 actions x 5 actors x 2 views). Here and in the legend of Fig. 1, it would be better to use the term “stimulus set” in place of “data set”.


(11) Clarify whether subjects were fixating or free viewing

Were subjects free viewing or fixating? This should be explicitly stated and the choice motivated in either case.


(12) Make figures more accessible

The figures are not optimal. Every panel showing decoding results should be clearly labelled to state what variables the crossvalidation procedure tested for generalisations across. For example, a label (in the figure itself!) could be: “decoding brain representations of actions with invariance to actor and view”. The reader shouldn’t have to search in the legend to find this essential information. Also every figure should have a stimulus bar depicting the period of stimulus presence. This is important especially to assess stimulus-offset-related effects, which appear to be present and significant.

Fig. 3 is great. I think it would be clearer to replace “space-time dot product” with “space-time convolution”.


(13) Clarify what the error bars represent

“Error bars represent standard deviation.”

Is this the standard deviation across the 8 subjects? Is it really the standard deviation or the standard error?


 (14) Clarify what we learn from the comparison between the structured and the unstructured model

For the unstructured model, won’t the machine learning classifier learn to select random combinations that tend to pool across different views of one action? This would render the resulting system computationally similar.




Imagining and seeing objects elicits consistent category-average activity patterns in the ventral stream


Horikawa and Kamitani report results of a conceptually beautiful and technically sophisticated study decoding the category of imagined objects. They trained linear models to decode visual image features from fMRI voxel patterns. The visual features are computed from images by computational models including GIST and the AlexNet deep convolutional neural net. AlexNet provides features spanning the range from visual to semantic. A subject is then scanned while imagining images from a novel object category (not used in training the fMRI decoder). The decoder is used to predict the computational-model representation for the imagined category (averaged across exemplars of that category). This predicted model representation is then compared to the actual model representation for many categories, including the imagined one. The model representation predicted from fMRI during imagery is shown to be significantly more similar to the model representation of images from the imagined category than to the model representation of images from other categories.


Figure from Horikawa & Kamitani (2015)

The methods are sophisticated and will give experts much to think about and draw from in developing better decoders. Comprehensive supplementary analyses, which I did not have time to fully review, complement and extend the thorough analyses provided. This is a great study. As usual in our field, a difficult question is what exactly it means for brain computational theory.

A few results that might speak to the computational mechanism of the ventral stream are as follows.

When predicting computational features of *single images* (which was only done for seen, not for imagined objects):

  • Lower layers of AlexNet are better predicted from voxels in lower ventral-stream areas.
  • Higher layers of AlexNet are better predicted from voxels in higher ventral-stream areas.
  • GIST features are best predicted from V1-3, but also significantly from higher areas.

This is consistent with the recent findings (Yamins, Khaligh-Razavi, Cadieu, Guclu) showing that deep convolutional neural nets explain lower- and higher-level ventral-stream areas with a rough correspondence of lower model layers to lower brain areas and higher model layers to higher brain areas. It is also consistent with previous findings that GIST, like many visual feature models, explains significant representational variance even in the higher ventral-stream representation (Khaligh-Razavi, Rice), but does not reach the noise ceiling (indicating that a data set is fully explained), as deep neural net models do (Khaligh-Razavi).

When predicting *category-averages* of computational features (which was done for seen and imagined objects):

  • Higher-level visual areas better predict features in all layers of AlexNet.
  • Higher layers of AlexNet are better predicted from voxels in all visual areas.

This is confusing, until we remember that it is category averages that are being predicted. Category averaging will retain a major portion of the representational variance of category-sensitive higher-level representations, while reducing the representational variance of low-level representations that are less related to categories. This may boost both predictions from category-related visual areas, as well as predictions of category-related model features.

Subjects imagined many different images from a given category in an experimental block during fMRI. The category-average imagery activity of the voxels was then used to predict the corresponding category-averages of the computational-model features. As expected, category-average computational-feature prediction is worse for mental imagery than for perception. The pattern across visual areas and AlexNet layers is similar for imagery and perception, with higher predictions resulting when the predicting visual area is category-related and when the predicted model feature is category-related. However, V1 and V2 did not consistently enable imagery decoding into the format of any of the layers of AlexNet. Interestingly, computational features more related to categories were better decodable. This supports the view that higher ventral-stream features might be optimised to emphasise categorical divisions (cf Jozwik et al. 2015).


Suggested improvements

(1) Clarify any evidence about the representational format in which the imagined content is represented. The authors’ model predicts both visual and semantic features of imagined object categories. This suggests that imagery involves both semantic and visual representations. However, the evidence for lower- or even mid-level visual representation of imagined objects is not very compelling here, because the imagery was not restricted to particular images. Instead the category-average imagery activity was measured. Each category is, of course, associated with particular visual features to some extent. We therefore expect to be able to predict category-average visual features from category-average voxel patterns better than chance. A strong claim that imagery paints low-level visual features into early visual representations would require imagery of particular images within each category. For relevant evidence, see Naselaris et al. (2015).

(2) Go beyond the decoding spin: what do we learn about computations in the ventral stream? Being able to decode brain representations is cool because it demonstrates unambiguously that a certain kind of information is present in a brain region. It’s even cooler to be able to decode into an open space of features or categories and to decode internally generated representations as done here. Nevertheless, the approach of decoding is also scientifically limiting. From the present version of the paper, the message I take is summarised in the title of the review: “Imagining and seeing objects elicits consistent category-average activity patterns in the ventral stream”. This has been shown previously (e.g. Stokes, Lee), but is greatly generalised here and is a finding so important that it is good to have it replicated and generalised in multiple studies. The reason why I can’t currently take a stronger computational claim from the paper is that we already know that category-related activity patterns cluster hierarchically in the ventral stream (Kriegeskorte et al. 2008) and may be continuously and smoothly related to a semantic space (Mitchell et al. 2008; Huth et al. 2012). In the context of these two pieces of knowledge, consistent category-average activity for perception and imagery is all that is needed to explain the present findings of decodability of novel imagined categories. The challenge to the authors: Can you test specific computational hypotheses and show something more on the basis of this impressive experiment? The semantic space analysis goes in this direction, but did not appear to me to support totally novel theoretical conclusions.

(3) Why decode computational features? Decoding of imagined content could be achieved either by predicting measured activity patterns from model representations of the stimuli (e.g. Kay et al. 2008) or by predicting model representations  from measured activity patterns (the present approach). The former approach is motivated by the idea that the model should predict the data and lends itself to comparing multiple models, thus contributing to computational theory. We will see below that the latter approach (chosen here) is less well suited to comparing alternative computational models. Why did Horikawa & Kamitani choose this approach? One argument might be that there are many model features and predicting the smaller number of voxels from these many features requires strong prior assumptions (implicit to regularisation), which might be questionable. The reverse prediction from voxels to features requires estimating the same total number of weights (# voxels * # model features), but each univariate linear model predicting a feature only has # voxels (i.e. typically fewer than # features) weights. Is this why you preferred this approach? Does it outperform the voxel-RF modelling approach of Kay et al. (2008) for decoding?

An even more important question is what we can learn about brain computations from feature decoding. If V4, say, perfectly predicted CNN1, this would suggest that V4 contains features similar to those in CNN1. However, it might additionally contain more complex features unrelated to CNN1. CNN1 predictability from V4, thus, would not imply that CNN1 can account for V4. Another example: CNN8 and GIST features are similarly predictable from voxel data across brain areas, and most predictable from V4 voxels. Does this mean GIST is as good a model as CNN8 for explaining the computational mechanism of the ventral stream? No. Even if the ventral-stream voxels perfectly predicted GIST, this would not imply that GIST perfectly predicts the ventral-stream voxels.

The important theoretical question is what computational mechanism gives rise to the representation in each area. For the human inferior temporal cortex, Khaligh-Razavi & Kriegeskorte (2015) showed that both GIST and the CNN representation explain significant variance. However, the GIST representation leaves a large portion of the explainable variance unexplained, whereas the CNN fully explains the explainable variance.

(4) Further explore the nature of the semantic space. To understand what drives the decoding of imagined categories, it would be helpful to see the performance of simpler analyses. Following Mitchell et al. (2008), one could use a text-corpus based semantic embedding to represent each of the categories. Decoding into this semantic embedding would similarly enable novel seen and imagined test categories (not used in training) to be decoded. It would be interesting, then, to successively reduce the dimensionality of the semantic embedding to estimate the complexity of the semantic space underlying the decoding. Alternatively, the authors’ WordNet distance could be used for decoding.

(5) Clarify that category-average patterns were used. The terms “image-based information” and “object-based information” are not ideal. By “image-based”, you are referring to a low-level visual representation and by “object-based”, to a categorical representation. Similarly, in many places where you say “objects” (as in “decoding objects”) it would be clearer to say “object categories”. Use clearer language throughout to clarify when it was category-average patterns that were used for prediction (brain representations) and that were predicted (model representations). This concerns the text and the figures. For example, the title of Fig. 4 should be: “Object-category-average feature decoding”. If this detracts the casual reader too lazy to even read the legends too much, at least the text of the legend should clearly state that category-average brain activity patterns are used to predict category-average model features.

(6) What are the assumptions implicit to sparse linear regression and is this approach optimal? L2 regularisation would spread the weights out over more voxels and might benefit from averaging out the noise component. Please comment on this choice and on any alternative performance results you may have.


Minor points

(7) The work is related to Mitchell et al. (2008), who predicted semantic semantic brain representations of novel stimuli using a semantic space model. This paper should be cited.

(8) “These studies showed a high representational similarity between the top layer of a convolutional neural network and visual cortical activity in the inferior temporal (IT) cortex of humans [24,25] and non-human primates [22,23].”

Ref 24 showed this for both human fMRI and macaque cell-recording data.

(9) “Interestingly, mid-level features were the most useful in identifying object categories, suggesting the significant contributions of mid-level representations in accurate object identification.”

This sentence repeats the same point after “suggesting”.