Imagining and seeing objects elicits consistent category-average activity patterns in the ventral stream


Horikawa and Kamitani report results of a conceptually beautiful and technically sophisticated study decoding the category of imagined objects. They trained linear models to decode visual image features from fMRI voxel patterns. The visual features are computed from images by computational models including GIST and the AlexNet deep convolutional neural net. AlexNet provides features spanning the range from visual to semantic. A subject is then scanned while imagining images from a novel object category (not used in training the fMRI decoder). The decoder is used to predict the computational-model representation for the imagined category (averaged across exemplars of that category). This predicted model representation is then compared to the actual model representation for many categories, including the imagined one. The model representation predicted from fMRI during imagery is shown to be significantly more similar to the model representation of images from the imagined category than to the model representation of images from other categories.


Figure from Horikawa & Kamitani (2015)

The methods are sophisticated and will give experts much to think about and draw from in developing better decoders. Comprehensive supplementary analyses, which I did not have time to fully review, complement and extend the thorough analyses provided. This is a great study. As usual in our field, a difficult question is what exactly it means for brain computational theory.

A few results that might speak to the computational mechanism of the ventral stream are as follows.

When predicting computational features of *single images* (which was only done for seen, not for imagined objects):

  • Lower layers of AlexNet are better predicted from voxels in lower ventral-stream areas.
  • Higher layers of AlexNet are better predicted from voxels in higher ventral-stream areas.
  • GIST features are best predicted from V1-3, but also significantly from higher areas.

This is consistent with the recent findings (Yamins, Khaligh-Razavi, Cadieu, Guclu) showing that deep convolutional neural nets explain lower- and higher-level ventral-stream areas with a rough correspondence of lower model layers to lower brain areas and higher model layers to higher brain areas. It is also consistent with previous findings that GIST, like many visual feature models, explains significant representational variance even in the higher ventral-stream representation (Khaligh-Razavi, Rice), but does not reach the noise ceiling (indicating that a data set is fully explained), as deep neural net models do (Khaligh-Razavi).

When predicting *category-averages* of computational features (which was done for seen and imagined objects):

  • Higher-level visual areas better predict features in all layers of AlexNet.
  • Higher layers of AlexNet are better predicted from voxels in all visual areas.

This is confusing, until we remember that it is category averages that are being predicted. Category averaging will retain a major portion of the representational variance of category-sensitive higher-level representations, while reducing the representational variance of low-level representations that are less related to categories. This may boost both predictions from category-related visual areas, as well as predictions of category-related model features.

Subjects imagined many different images from a given category in an experimental block during fMRI. The category-average imagery activity of the voxels was then used to predict the corresponding category-averages of the computational-model features. As expected, category-average computational-feature prediction is worse for mental imagery than for perception. The pattern across visual areas and AlexNet layers is similar for imagery and perception, with higher predictions resulting when the predicting visual area is category-related and when the predicted model feature is category-related. However, V1 and V2 did not consistently enable imagery decoding into the format of any of the layers of AlexNet. Interestingly, computational features more related to categories were better decodable. This supports the view that higher ventral-stream features might be optimised to emphasise categorical divisions (cf Jozwik et al. 2015).


Suggested improvements

(1) Clarify any evidence about the representational format in which the imagined content is represented. The authors’ model predicts both visual and semantic features of imagined object categories. This suggests that imagery involves both semantic and visual representations. However, the evidence for lower- or even mid-level visual representation of imagined objects is not very compelling here, because the imagery was not restricted to particular images. Instead the category-average imagery activity was measured. Each category is, of course, associated with particular visual features to some extent. We therefore expect to be able to predict category-average visual features from category-average voxel patterns better than chance. A strong claim that imagery paints low-level visual features into early visual representations would require imagery of particular images within each category. For relevant evidence, see Naselaris et al. (2015).

(2) Go beyond the decoding spin: what do we learn about computations in the ventral stream? Being able to decode brain representations is cool because it demonstrates unambiguously that a certain kind of information is present in a brain region. It’s even cooler to be able to decode into an open space of features or categories and to decode internally generated representations as done here. Nevertheless, the approach of decoding is also scientifically limiting. From the present version of the paper, the message I take is summarised in the title of the review: “Imagining and seeing objects elicits consistent category-average activity patterns in the ventral stream”. This has been shown previously (e.g. Stokes, Lee), but is greatly generalised here and is a finding so important that it is good to have it replicated and generalised in multiple studies. The reason why I can’t currently take a stronger computational claim from the paper is that we already know that category-related activity patterns cluster hierarchically in the ventral stream (Kriegeskorte et al. 2008) and may be continuously and smoothly related to a semantic space (Mitchell et al. 2008; Huth et al. 2012). In the context of these two pieces of knowledge, consistent category-average activity for perception and imagery is all that is needed to explain the present findings of decodability of novel imagined categories. The challenge to the authors: Can you test specific computational hypotheses and show something more on the basis of this impressive experiment? The semantic space analysis goes in this direction, but did not appear to me to support totally novel theoretical conclusions.

(3) Why decode computational features? Decoding of imagined content could be achieved either by predicting measured activity patterns from model representations of the stimuli (e.g. Kay et al. 2008) or by predicting model representations  from measured activity patterns (the present approach). The former approach is motivated by the idea that the model should predict the data and lends itself to comparing multiple models, thus contributing to computational theory. We will see below that the latter approach (chosen here) is less well suited to comparing alternative computational models. Why did Horikawa & Kamitani choose this approach? One argument might be that there are many model features and predicting the smaller number of voxels from these many features requires strong prior assumptions (implicit to regularisation), which might be questionable. The reverse prediction from voxels to features requires estimating the same total number of weights (# voxels * # model features), but each univariate linear model predicting a feature only has # voxels (i.e. typically fewer than # features) weights. Is this why you preferred this approach? Does it outperform the voxel-RF modelling approach of Kay et al. (2008) for decoding?

An even more important question is what we can learn about brain computations from feature decoding. If V4, say, perfectly predicted CNN1, this would suggest that V4 contains features similar to those in CNN1. However, it might additionally contain more complex features unrelated to CNN1. CNN1 predictability from V4, thus, would not imply that CNN1 can account for V4. Another example: CNN8 and GIST features are similarly predictable from voxel data across brain areas, and most predictable from V4 voxels. Does this mean GIST is as good a model as CNN8 for explaining the computational mechanism of the ventral stream? No. Even if the ventral-stream voxels perfectly predicted GIST, this would not imply that GIST perfectly predicts the ventral-stream voxels.

The important theoretical question is what computational mechanism gives rise to the representation in each area. For the human inferior temporal cortex, Khaligh-Razavi & Kriegeskorte (2015) showed that both GIST and the CNN representation explain significant variance. However, the GIST representation leaves a large portion of the explainable variance unexplained, whereas the CNN fully explains the explainable variance.

(4) Further explore the nature of the semantic space. To understand what drives the decoding of imagined categories, it would be helpful to see the performance of simpler analyses. Following Mitchell et al. (2008), one could use a text-corpus based semantic embedding to represent each of the categories. Decoding into this semantic embedding would similarly enable novel seen and imagined test categories (not used in training) to be decoded. It would be interesting, then, to successively reduce the dimensionality of the semantic embedding to estimate the complexity of the semantic space underlying the decoding. Alternatively, the authors’ WordNet distance could be used for decoding.

(5) Clarify that category-average patterns were used. The terms “image-based information” and “object-based information” are not ideal. By “image-based”, you are referring to a low-level visual representation and by “object-based”, to a categorical representation. Similarly, in many places where you say “objects” (as in “decoding objects”) it would be clearer to say “object categories”. Use clearer language throughout to clarify when it was category-average patterns that were used for prediction (brain representations) and that were predicted (model representations). This concerns the text and the figures. For example, the title of Fig. 4 should be: “Object-category-average feature decoding”. If this detracts the casual reader too lazy to even read the legends too much, at least the text of the legend should clearly state that category-average brain activity patterns are used to predict category-average model features.

(6) What are the assumptions implicit to sparse linear regression and is this approach optimal? L2 regularisation would spread the weights out over more voxels and might benefit from averaging out the noise component. Please comment on this choice and on any alternative performance results you may have.


Minor points

(7) The work is related to Mitchell et al. (2008), who predicted semantic semantic brain representations of novel stimuli using a semantic space model. This paper should be cited.

(8) “These studies showed a high representational similarity between the top layer of a convolutional neural network and visual cortical activity in the inferior temporal (IT) cortex of humans [24,25] and non-human primates [22,23].”

Ref 24 showed this for both human fMRI and macaque cell-recording data.

(9) “Interestingly, mid-level features were the most useful in identifying object categories, suggesting the significant contributions of mid-level representations in accurate object identification.”

This sentence repeats the same point after “suggesting”.