[Below is my secret peer review of Cichy, Pantazis & Oliva (2014). The review below applies to the version as originally submitted, not to the published version that the link refers to. Several of the concrete suggestions for improvements below were implemented in revision. Some of the more general remarks on results and methodology remain relevant and will require further studies to completely resolve. For a brief summary of the methods and results of this paper, see Mur & Kriegeskorte (2014).]
This paper describes an excellent project, in which Cichy et al. analyse the representational dynamics of object vision using human MEG and fMRI on a set of 96 object images whose representation in cortex has previously been studied in monkeys and humans. The previous studies provide a useful starting point for this project. However, the use of MEG in humans and the combination of MEG and fMRI enables the authors to characterise the emergence of categorical divisions at a level of detail that has not previously been achieved. The general approaches of MEG-decoding and MEG-RSA pioneered by Thomas Carlson et al. (2013) are taken to new level here by using a richer set of stimuli (Kiani et al. 2007; Kriegeskorte et al. 2008). The experiment is well designed and executed, and the general approach to analysis is well-motivated and sound. The results are potentially of interest to a broad audience of neuroscientists. However, the current analyses lack some essential inferential components that are necessary to give us full confidence in the results, and I have some additional concerns that should be addressed in a major revision as detailed below.
(1) Confidence-intervals and inference for decoding-accuracy, RDM-correlation time courses and peak-latency effects
Several key inferences depend on comparing decoding accuracies or RDM correlations as a function of time, but the reliability of these estimates is not assessed. The paper also currently gives no indication of the reliability of the peak latency estimates. Latency comparisons are not supported by statistical inference. This makes it difficult to draw firm conclusions. While the descriptive analyses presented are very interesting and I suspect that most of the effects the authors highlight are real, it would be good to have statistical evidence for the claims. For example, I am not confident that the animate-inanimate category division peaks at 285 ms. This peak is quite small and on top of a plateau. Moreover, the time the category clustering index reaches the plateau (140 ms) appears more important. However, interpretation of this feature of the time course, as well, would require some indication of the reliability of the estimate.
I am also not confident that the RDM-correlation between the MEG and V1-fMRI data really has a significantly earlier peak than the RDM-correlation between the MEG and IT-fMRI data. This confirms our expectations, but it is not a key result. Things might be more complicated. I would rather see unexpected result of a solid analysis than an expected result of an unreliable analysis.
Ideally, adding 7 more subjects would allow random effects analyses. All time courses could then be presented with error margins (across subjects, supporting inference to the population by treating subjects as a random-effect dimension). This would also lend additional power to the fixed-effects inferential analyses.
However, if the cost of adding 7 subjects is considered too great, I suggest extending the approach of bootstrap resampling of the image set. This would provide reliability estimates (confidence intervals) for all accuracy estimates and peak latencies and support testing peak-latency differences. Importantly, the bootstrap resampling would simulate the variability of the estimates across different stimulus samples (from a hypothetical population of isolated object images of the categories used here). It would, thus, provide some confidence that the results are not dependent on the image set. Bootstrap resampling each category separately would ensure that all categories are equally represented in each resampling.
In addition, I suggest enlarging the temporal sliding window in order to stabilise the time courses, which look a little wiggly and might give unstable estimates of magnitudes and latencies across bootstrap samples otherwise – e.g. the 285 ms animate-inanimate discrimination peak. This will smooth the time courses appropriately and increase the power. A simple approach would be to use a bigger time steps as well, e.g. 10- or 20-ms bins. This would provide more power in Bonferroni correction across time. Alternatively, the false-discovery rate could be used to control false positives. This would work equally well for overlapping temporal windows (e.g. 20-ms window, 1 ms steps).
(2) Testing linear separability of categories
The present version of the analyses uses averages of pairwise stimulus decoding accuracies. The decoding accuracies serve as measures of representational discriminability (a particular representational distance measure). This is fine and interpretable. The average between minus the average within discriminability is a measure of clustering, which is a stronger result in a sense than linear decodability. However, it would be good to see is linear decoding of each category division reveals additional or earlier effects. While your clustering index essentially implies linear separability, the opposite is not true. For example, two category regions arranged like stacked pancakes could be perfectly linearly discriminable while having no significant clustering (i.e. difference between the within and between category discriminabilities of image pairs). Like this:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Each number indexes a category and each repetition represents an exemplar. The two lines illustrate the pancake situation. If the vertical separation of the pancakes is negligible, they are perfectly linearly discriminable, despite a negligible difference between the average within and the average between distance. It would be interesting to see these linear decoding analyses performed using either indendent response measurements for the same stimuli or an independent (held out) set of stimuli as the test set. This would more profoundly address different aspects of the representational geometry.
For the same images in the test set, pairwise discriminability in a high-dimensional space strongly suggests that any category dichotomy can be linearly decoded. Intuitively, we might expect the classifier to generalise well to the extent that the categories cluster (within distances < between distances) – but this need not be the case (e.g. the pancake scenario might also afford good generalization to new exemplars).
(3) Circularity of peak-categorical MDS arrangements
The peak MDS plots are circular in that they serve to illustrate exactly the effect (e.g. animate-inanimate separation) that the time point has been chosen to maximise. This circularity could easily be removed by selecting the time point for each subject based on the other subjects’ data. The accuracy matrices for the selected time points could then be averaged across subjects for MDS.
(4) Test individual-level match of MEG and fMRI
It is great that fMRI and MEG data was acquired in the same participants. This suggests an analysis of the consistent reflection of individual idiosyncrasies in object processing in fMRI and MEG. One way to investigate this would be to correlate single-subject RDMs between MEG and fMRI, within and between subjects. If the within-subject MEG-fMRI RDM correlation is greater (at any time point), then MEG and fMRI consistently reflect individual differences in object processing.
Why average trials? Decoding accuracy means nothing then. Single-trial decoding accuracy and information in bits would be interesting to see and useful to compare to later studies.
The stimulus-label randomisation test for category clustering (avg(between)-avg(within)) is fine. However, the bootstrap test as currently implemented might be problematic.
“Under the null hypothesis, drawing samples with replacement from D0 and calculating their mean decoding accuracy daDo, daempirial should be comparable to daD0. Thus, assuming that D has N labels (e.g. 92 labels for the whole matrix, or 48 for animate objects), we draw N samples with replacement and compute the mean daD0 of the drawn samples.”
I understand the motivation for this procedure and my intuition is that this test is likely to work, so this is a minor point. However, subtracting the mean empirical decoding accuracy might not be a valid way of simulating the null hypothesis. Accuracy is a bounded measure and its distribution is likely to be wider under the null than under H1. The test is likely to be valid, because under H0 the simulation will be approximately correct. However, to test if some statistic significantly exceeds some fixed value by bootstrapping, you don’t need to simulate the null hypothesis. Instead, you simulate the variability of the estimate and obtain a 95%-confidence interval by bootstrapping. If the fixed value falls outside the interval (which is only going to happen in about 5% of the cases under H0), then the difference is significant. This seems to me a more straightforward and conventional test and thus preferable. (Note that this uses the opposite tail of the distribution and is not equivalent here because the distribution might not be symmetrical.)
Fig. 1: Could use two example images to illustrate the pairwise classification.
It might be good here to see real data in the RDM on the upper right (for one time point) to illustrate.
“non-metric scaling (MDS)”: add multidimensional
Why non-metric? Metric MDS might more accurately represent the relative representational distances.
“The results (Fig. 2a, middle) indicate that information about animacy arises early in the object processing chain, being significant at 140 ms, with a peak at 285 ms.”
I would call that late – relative to the rise of exemplar discriminability.
“We set the confidence interval at p=”
This is not the confidence interval, this is the p value.
“To our knowledge, we provide here the first multivariate (content-based) analysis of face- and body-specific information in EEG/MEG.”
The cited work by Carlson et al. (2012) shows very similar results – but for fewer images.
“Similarly, it has not been shown previously that a content-selective investigation of the modulation of visual EEG/MEG signals by cognitive factors beyond a few everyday categories is possible24,26.”
Carlson et al. (2011, 2012; refs 22,23) do show similar results.
I don’t understand what you mean by “cognitive factors” here.
What are called “percentiles”, are not percentiles: multiply by 100.
“For example, a description of the temporal dynamics of face representation in humans might be possible with a large and rich parametrically modulated stimulus set as used in monkey electrophysiology 44.”
Should cite Freiwald (43), not Shepard (44) here.
LANGUAGE, LOGIC, STYLE, AND GRAMMAR
Although the overall argument is very compelling, there were a number of places in the manuscript where I came across weaknesses of logic, style, and grammar. The paper also had quite a lot of typos. I list some of these below, to illustrate, but I think the rest of the text could use more work as well to improve precision and style.
One stylistic issue is that the paper switches between present and past tense without proper motivation (general statement versus report of procedures used in this study).
Abstract: “individual image classification is resolved within 100ms”
Who’s doing the classification here? The brain or the researchers? Also: discriminating two exemplars is not classification (except in a pattern-classifier sense). So this is not a good way to state this. I’d say individual images are discriminated by visual representations within 100 ms.
“Thus, to gain a detailed picture of the brain’s  in object recognition it is necessary to combine information about where and when in the human brain information is processed.”
Some phrase missing there.
“Characterizing both the spatial location and temporal dynamics of object processing
demands innovation in neuroimaging techniques”
A location doesn’t require characterisation, only specification. But object processing does not take place in one location. (Obviously, the reader may guess what you mean — but you’re not exactly rewarding him or her for reading closely here.)
“In this study, using a similarity space that is common to both MEG and fMRI,”
Style. I don’t know how a similarity space can be common to two modalities. (Again, the meaning is clear, but please state it clearly nevertheless.)
“…and show that human MEG responses to object correlate with the patterns of neuronal spiking in monkey IT”
“1) What is the time course of object processing at different levels of categorization?”
Does object processing at a given level of categorisation have a single time course? If not, then this doesn’t make sense.
“2) What is the relation between spatially and temporally resolved brain responses in a content-selective manner?”
This is a bit vague.
“The results of the classification (% decoding accuracy, where 50% is chance) are stored in a 92 × 92 matrix, indexed by the 92 conditions/images images.”
“Can we decode from MEG signals the time course at which the brain processes individual object images?”
Grammatically, you can say that the brain processes images “with” a time course, not “at” a time course. In terms of content, I don’t know what it means to say that the brain processes image X with time course Y. One neuron or region might respond to image X with time course Y. The information might rise and fall according to time course Y. Please say exactly what you mean.
“A third peek at ~592ms possibly indicates an offset-response”
“Thus, multivariate analysis of MEG signals reveal the temporal dynamics of visual content processing in the brain even for single images.”
“This initial result allows to further investigate the time course at which information about membership of objects at different levels of categorization is decoded, i.e. when the subordinate, basic and superordinate category levels emerge.”
Unclear here what decoding means. Are you suggesting that the categories are encoded in the images and the brain decodes them? And you can tell when this happens by decoding? This is all very confusing. 😉
“Can we determine from MEG signals the time course at which information about membership of an image to superordinate-categories (animacy and naturalness) emerges in the brain?”
Should say “time course *with* which”. However, all the information is there in the retina. So it doesn’t really make sense to say that the information emerges with a time course. What is happening is that category membership becomes linearly decodable, and thus in a sense explicit, according to this time course.
“If there is information about animacy, it should be mirrored in more decoding accuracy for comparisons between the animate and inanimate division than within the average of the animate and inanimate division.”
more decoding accuracy -> greater decoding accuracy
“within the average” Bad phrasing.
“utilizing the same date set in monkey electrophysiology and human MRI”
–> stimulus set
“Boot-strapping labels tests significance against chance”
Determining significance is by definition a comparison of the effect estimate against what would be expected by chance. It therefore doesn’t make sense to “test significance against chance”.
“corresponding to a corresponding to a p=2.975e-5 for each tail”
“Given thus [?] a fMRI dissimilarity matrices [grammar] for human V1 and IT each, we calculate their similarity (Spearman rank-order correlation) to the MEG decoding accuracy matrices over time, yielding 2nd order relations (Fig. 4b).”
“We recorded brain responses with fMRI to the same set of object images used in the MEG study *and the same participants*, adapting the stimulation paradigm to the specifics of fMRI (Supplementary Fig. 1b).”
The effect is significant in IT (p<0.001), but not in V1 (p=0.04). Importantly, the effect is significantly larger in IT than in V1 (p<0.001).
p=0.04 is also significant, isn’t it? This is very similar to Kriegeskorte et al. (2008, Fig. 5A), where the animacy effect was also very small, but significant in V1.
“boarder” -> “border”