A cortico-thalamic model of multi-scale incremental feature binding

Our retinae sample the images in our eyes discretely, conveying a million local measurements through the optic nerve to our brains. Given this piecemeal mess of signals, our brains infer the structure of the scene, giving us an almost instant sense of the geometry of the environment and of the objects and their relationships.

We see the world in terms of objects. But how our visual system defines what an object is and how it represents objects is not well understood. Two key properties thought to define what an object is in philosophy and psychology are spatiotemporal continuity and cohesion (Scholl 2007). An object can be thought of as a constellation of connected parts, such that if we were to pull on one part, the other parts would follow along, while other objects might stay put. Because the parts cohere, the region of spacetime that corresponds to an object is continuous. The decomposition of the scene into potentially movable objects is a key abstraction that enables us to perceive, not just the structure and motion of our surroundings, but also the proclivities of the objects (what might drop, collapse, or collide) and their affordances (what might be pushed, moved, taken, used as a tool, or eaten).

An important computational problem our visual system must solve, therefore, is to infer what pieces of a retinal image belong to a single object. This problem has been amply studied in humans and nonhuman primates using behavioral experiments and measurements of neural activity. A particular simplified task that has enabled highly controlled experiments is mental line tracing. A human subject or macaque fixating on a central cross is presented with a display of multiple curvy lines, one of which begins at the fixation point. The task is to judge whether a peripheral red dot is on that line or on another line (called a distractor). Behavioral experiments show that the task is easy to the extent that the target line is short or isolated from any distractors. Adding distractor lines in the vicinity of the target line to clutter up the scene and making the target line long and curvy makes the task more difficult. If the target snakes its way through complex clutter closeby, it is no longer instantly obvious where it leads and attention and time are required to judge whether the red dot is on the target or on a distractor line.

Our reaction time is longer when the red dot is farther from fixation along the target line. This suggests that the cognitive process required to make the judgment involves tracing the line with a sequential algorithm, even when fixation is maintained at the central cross. However, the reaction time is not in general linear in the distance, measured along the line, between the fixation point and the dot, as would be predicted by sequential tracing of the line at constant speed. Instead, the speed of tracing is variable depending on the presence of distracting lines in the vicinity of the current location of the tracing process along the target line. Tracing proceeds more slowly when there are distracting lines close by and more quickly when the distracting lines are far away.

The hypothesis that the primate visual system traces the line sequentially from the fixation point is supported by seminal electrophysiological experiments by Pieter Roelfsema and colleagues, which have shown that neurons in early visual cortex that represent particular pieces of the line emanating from the fixation point are upregulated in sequence, consistent with a sequential tracing process. This sequential upregulation of activity of neurons representing progressively more distal portions of the line is often interpreted as the neural correlate of attention spreading from fixation along the attended line during task performance.

The variation in speed of the tracing process can be explained by the attentional growth-cone hypothesis (Pooresmaeili & Roelfsema 2014) which posits that attention spreads not only in the primary visual cortex but also at higher levels of cortical representation. This hypothesis can explain the variation in tracing speed: At higher levels of cortical visual representation, neurons have larger receptive fields and offer a coarser-scale summary of the image, enabling the tracing to proceed at greater speed along the line in the image. In the absence of distractors, tracing can proceed quickly at a high-level of representation. However, in the presence of distractors, the higher-level representations may not be able to resolve the scene at a sufficient grain, and tracing must proceed more slowly in lower-level representations.

Higher-level neurons are more likely to suffer from interference from distractor lines within their larger receptive fields. If a distractor line is present in a neuron’s receptive field, the neuron may not respond as strongly to the line being traced, effectively blocking the path for sequential tracing in the high-level representation. However, tracing can continue – more slowly – at lower levels, where receptive fields are small enough to discern the line without interference.

Detail from Fig. 6 in Pooresmaeili et al. (2014) illustrating the single-scale tracing model (left) and the growth-cone model (right), in which the attentional label is propagated from the fixation point (small red dot) at all levels of representation where receptive fields (circles) do not overlap with the distractor curve. Tracing proceeds rapidly at coarse scales (orange, blue) where the target line is far from the distractor and slowly at fine scales (yellow, green) where the target curve comes close to the distractor.

Now Schmid & Neumann (pp2024) offer a brain-computational model explaining in detail how this multiscale algorithm for attentional selection of the line emanating from the fixation point might be implemented in the primate brain. They describe a mechanistic model and demonstrate by simulation that it can explain how mental line tracing might be implemented in the primate brain.

Pyramidal neurons at multiple levels of the visual hierarchy (corresponding to cortical areas V1, V2, V4) detect local oriented line segments on the basis of the bottom-up signals arriving at their basal dendritic integration sites. These line segments are pieces of the target and distractor lines, represented in each area at a different scale of representation. The pyramidal neurons also receive lateral and top-down input providing contextual information at their apical dendritic integration sites, enabling them to sense whether the line segment they are representing is part of a longer continuous line.

The attentional “label” indicating that a neuron represents a piece of the target line is encoded by an upregulation of the activity of the pyramidal neurons, consistent with neural recording results from Roelfsema and colleagues (1998). The upregulation of activity, i.e. the attention label, can spread laterally within a single area such as V1. Connectivity between neurons representing approximately collinear line segments implements an inductive bias that favors interpretations conforming to the Gestalt principle of good continuation. However, the upregulation will spread only to pyramidal neurons that (1) are activated by the stimulus, (2) receive contextual input from pyramidal neurons representing approximately collinear line segments, and (3) receive thalamic input indicating the local presence of the attentional marker.

Each step of propagation is conditioned on the conjunction of these three criteria. The neural computations could be implemented exploiting the intracellular dynamics in layer-5 pyramidal neurons, where dendritic inputs entering at apical integration sites cannot drive a response by themselves but can modulate responses to inputs entering at basal integration sites. An influential theory suggests that contextual inputs arriving at the apical dendritic integration sites modulate the response to bottom-up stimulus inputs arriving at the basal dendritic integration sites (Larkum 2013, BrainInspired podcast). Schmid and Neumann’s model further posits that the apical inputs are gated by thalamic inputs (Halassa & Kastner 2017), implementing a test of the third criterion for propagation of the attentional label.

The attentional label is propagated locally from already labeled pyramidal neurons to pyramidal neurons at all levels of the visual hierarchy that represent closeby line segments sufficiently aligned in orientation to be consistent with their being part of the target line. To enable the coarser-scale representations in higher cortical areas to speed the process, neurons representing the same patch of the visual field at different scales are connected through thalamocortical loops. Through the thalamus, each level is connected to all other levels, enabling label propagation to bypass the stages of the hierarchy. The thalamic component (possibly in the pulvinar region of the visual thalamus) represents a map of the labeled locations, but not detailed orientation information.

Imagine a mechanical analogy, in which tube elements represent local segments of the lines. The stimulus-driven bottom-up signals align the orientations of the tube elements with the orientations of the line segments they represent, so the tube elements turn to form long continuous tunnels depicting the lines. A viscous liquid is injected into the tube element representing the fixation point and spreads. Adjacent tube elements need to be aligned for the liquid to flow from one into the other. In addition, there are valves between the tube elements, which open only in the presence of thalamic input. Importantly, the viscous liquid can flow not only at the V1 level of representation, where the tube elements represent tiny pieces of the lines and the viscous liquid needs to flow through many elements to reach the end of the line. Rather, the liquid can also take shortcuts through higher-level representations, where long stretches of the line are represented by few tube elements. This enables the liquid to reach the end of the line much more quickly – to the extent that there are stretches sufficiently isolated from the distractors for coarse-scale representation at higher levels of the hierarchy.

Since the information about (1) the presence of oriented line segments, (2) their compatibility according to the Gestalt principle of good continuation, and (3) the attentional label are all available in the cortical hierarchy, a growth-cone algorithm could be implemented without thalamocortical loops. However, Schmid and Neumann argue that the non-orientation-specific thalamic representation reduces the complexity of the circuit. Fewer connections are required by decomposing the question “Are there upregulated compatible signals in the neighborhood?” into two simpler questions: “Are there compatible signals in the neighborhood?” (answered by cortex) and “Are there upregulated signals in the neighborhood?” (answered by the thalamic input). Because there could be compatible signals in the neighborhood that are not upregulated, and upregulated signals that are not compatible, yeses to both questions of the decomposition do not in general imply a yes to the original question. However, if we assume that there is only one line segment per location, then two yeses do imply a yes to the original question.

Schmid and Neumann argue that thalamic label map enables a simpler circuit that works in the simulations presented, even tracing a line as it crosses another line without spillover. We wonder if, in addition to requiring fewer connections, the thalamic label map might have functional advantages in the context of a system that must be able to perform not just line tracing but many other binding tasks, where the thalamus might have the same role, but the priors defining compatibility could differ.

Why is this model important? Line tracing is a type of computational problem that is prototypical of vision and yet challenging for both of our favorite modes of thinking about visual computations: deep feedforward neural networks and probabilistic inference. These two approaches (discriminative and generative to a first approximation) form diametrically opposed corners in a vast space of visual algorithms that has only begun to be explored (Peters et al. pp2023). Line tracing is a simple example of a visual cognition task that can be rendered intractable for both approaches by making the line snaking its way through the clutter sufficiently long and the clutter sufficiently close and confusing. Feedforward deep neural networks have trouble with this kind of problem because there are no hints in the local texture revealing the long-range connectivity of the lines. The combinatorics creates too rich a space of possible curves to represent with a hierarchy of features in a neural network. Although any recurrent computation (including the model of Schmid and Neumann and a recent line tracing model from Linsley & Serre, 2019) can be unfolded into a feedforward computational graph, the feedforward network would have to be very deep, and its parameters might be hard to learn without the inductive bias that iterating the same local propagation rule is the solution to the puzzle (van Bergen & Kriegeskorte 2020). From a probabilistic inference perspective, similarly, the problem is likely intractable in its general form because of the exponential number of possible groupings we would need to compute a posterior distribution over.

By assuming that we can be certain about the way things connect locally, we can avoid having to maintain a probability distribution over all possible line continuations from the fixation point. Binarizing the probabilities turns the problem into a region growing (or graph search) problem requiring a sequential procedure, because later steps depend on the result of earlier steps.

Schmid and Neumann’s paper describes how the previously proposed growth-cone algorithm, which solves an important computational challenge at the heart of visual cognition (Roelfsema 2006), might be implemented in the primate brain. The paper seriously engages both the neuroscience (at least at a qualitative level) and the computational problem, and it connects the two. The authors simulate the model and demonstrate its predictions of the key behavioral and neurophysiological results from the literature. They use model-ablation experiments to establish the necessity of different components. They also describe the model at a more abstract level: reducing the operations to sequential logical operations and systematically considering different possible implementations in a circuit and their costs in terms of connections. This resource-cost perspective deepens our understanding of the algorithm and reveals that the proposed model is attractive not only for its consistency with neuroanatomical, neurophysiological, and behavioral data, but also for the efficiency of implementation in a physical network.

Strengths

Offers a candidate explanation for how an important cognitive function might be implemented in the primate brain, using an algorithm that combines parallel computation, hierarchical abstraction, and sequential inference.
Motivated by a large body of experimental evidence from neurophysiological and behavioral experiments, the model is consistent with primate neuroanatomy, neural connectivity, neurophysiology, and subcellular dynamics in multi-compartment pyramidal neurons.
Describes a class of related algorithms and network implementations at an abstract level, providing a deeper understanding of alternative possible neural mechanisms that could perform this cognitive function and their network complexity.

Weaknesses

The model operates on a toy version of the task, using abstracted stimuli with few orientations and predefined Gabor filter banks as model representations, rather than more general visual representations learned from natural images. An important question is to what extent the algorithm will be able to perform visual tasks on natural images. Given the complexity of the paper as is, this question should be considered beyond the scope, but related work connecting these ideas to computer vision could be discussed in more detail.

Major suggestions

(1) Illustrate the computational mechanism and operation of the model more intuitively. In Fig. 1b, colors code for the level of representations. It would therefore be better to not use green to code for the selection tag. Thicker black contours or some other non-color marker could be used. It is also hard to see that the no-interference and the interference cases have different stimuli. Only the bottom panels with the stimuli show a slight difference. The top panels should be distinct as well since different neurons would be driven by the two stimuli. Alternatively, you could consider using only one stimulus, where the distractor distance variation is quite pronounced, but showing time frames to illustrate the variation of the speed of the progression of attentional tagging.

(2) Discuss challenges in scaling and calibrating the model for application to natural continuous curves. The stimuli analyzed have only a few orientations with sudden transitions from one to the other. Would the model as implemented also work for continuous curves such as those used in the neurophysiological and behavioral experiments or would a finer tiling of orientations be required? Under what conditions would attention spill over to nearby distractor curves? It would be good to elaborate on the roles of surround suppression, inhibition among detectors, and the excitation/inhibition balance.

(3) Discuss challenges in scaling the model to computer vision tasks on natural images. To be viable as brain-computational theories, models ultimately need to scale to natural tasks. Please address the challenges of extending the model for application to natural images and computer-vision tasks. This will likely require the representations to be learned through backpropagation. The cited complementary work by Linsley and Serre on the pathfinder task using horizontal gated recurrent units and incremental segmentation for computer vision is relevant here and deserves to be elaborated on in the Discussion. In particular, do the growth-cone model and your modeling results suggest an alternative neural network architecture for learning incremental binding operations?

Minor suggestions

(1) Please make sure that the methods section contains all the details of the model architecture needed for replication of the work. Much of the math is described well. But some additional technical details on maps and connectivity may be needed. What are the sizes of the maps? What do they look like for a given input? Do they appear like association fields? What is the excitatory and inhibitory connectivity as a function of spatial locations and orientations of the source and target unit?

(2) Discuss how the model relates to the results of Chen et al. (2014) who described the interplay between V1 and V4 during incremental contour integration on the basis of simultaneous recordings in monkeys.

(3) Although the paper is well-written and clear, the English is a bit rocky throughout with many grammatical errors and some typos. These could be fixed using a proofreader or suitable software.

– Nikolaus Kriegeskorte & Hossein Adeli

References

Chen M, Yan Y, Gong X, Gilbert CD, Liang H, Li W (2014) Incremental integration of global contours through interplay between visual cortical areas Neuron.

Halassa MM, Kastner S. Thalamic functions in distributed cognitive control. Nature neuroscience. 2017 Dec;20(12):1669-79.

Lamme VA, Roelfsema PR (2000) The distinct modes of vision offered by feedforward and recurrent processing Trends in Neurosciences.

Larkum M. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in neurosciences. 2013 Mar 1;36(3):141-51.

Larkum ME, Zhu JJ, Sakmann B (1998) A new cellular mechanism for coupling inputs arriving at different cortical layers Nature.

Linsley D, Kim J, Veerabadran V, Serre T (2019) Learning long-range spatial dependencies with horizontal gated-recurrent units NeurIPS. arxiv.org/abs/1805.08315.

Peters B, Kriegeskorte N (2021) Capturing the objects of vision with neural networks. Nature human behaviour Nature Human Behavior.

Peters B, DiCarlo JJ, Gureckis T, Haefner R, Isik L, Tenenbaum J, Konkle T, Naselaris T, Stachenfeld K, Tavares Z, Tsao D, Yildirim I, Kriegeskorte N (under review) How does the primate brain combine generative and discriminative computations in vision? CCN GAC paper. arXiv preprint arXiv:2401.06005.

Pooresmaeili A, Roelfsema PR (2014) A growth-cone model for the spread of object-based attention during contour grouping Current Biology.

Roelfsema PR, Lamme VA, Spekreijse H (1998) Object-based attention in the primary visual cortex of the macaque monkey Nature.

Roelfsema PR (2006) Cortical algorithms for perceptual grouping Annu Rev Neurosci.

Scholl BJ (2007) Object persistence in philosophy and psychology Mind & Language.

van Bergen RS, Kriegeskorte N (2020) Going in circles is the way forward: the role of recurrence in visual inference Current Opinion in Neurobiology.

Deep net representational geometries become more similar to the ventral stream as performance is optimised

[R6I7]

Seibert, Yamins, Ardila, Hong, DiCarlo, and Gardner compared a deep convolutional neural network for visual object recognition to human ventral-stream representations as measured with fMRI (PP). The network was similar to the one described in Krizhevsky et al. (2012), the network that won the ImageNet competition that year with a large increase in performance compared to previous computer vision systems. The representations in the layers of the Krizhevsky deep net and similar models have been compared to human and monkey brain representations at different stages of the ventral stream previously (Yamins et al. 2013, Yamins et al. 2014; Khaligh-Razavi & Kriegeskorte 2014; Cadieu et al. 2014; Güçlü & van Gerven 2015). The present study is consistent with the previous results, generalises this line work to an interesting new set of test images, and investigates how the representational similarity of the model layers to the brain areas evolves as model performance is optimised. Results suggest that the optimisation of recognition performance increases representational similarity to visual areas, even for early and mid-level visual areas.

Model architecture: The convolutional network was inspired by that of Krizhevsky et al. (2012), using similar convolutional filter sizes, rectified linear units, the same pooling and local normalisation procedures, and data from ImageNet for training on 1000-class categorisation. However, the input images were downsampled to a substantially smaller size (120 x 120 pixels, instead of 224 x 224 pixels). Another major modification was that two intermediate fully connected layers (which contain most of the parameters in Krizhevsky et al.’s net) were omitted. This is reported to have no significant effect on recognition performance on an independent ImageNet test set.

Training and test stimuli: Like Krizhevsky et al., Seibert et al. trained the network by backpropagation to classify objects into 1000 categories. They used the very large ImageNet set of labelled images for model training and then presented the network and two human subjects with a different set of more controlled images: 1,785 grayscale images of 3D renderings of objects in many positions and views, superimposed to random natural backgrounds.

Representational similarity analysis: The authors compared the representational dissimilarity matrices (RDMs) between model layers and brain areas. They first randomly selected 1000 model features from a given layer, then reweighted these features, stretching and squeezing the representational space along its original axes, so as to maximise the RDM correlation between the model layer and the brain region. The maximisation of the RDM correlation was performed on the basis of 15 of the images for each of the 64 objects (different positions, views, and backgrounds). Using the fitted weights, they then re-estimated the model RDMs on the basis of the other 12 position-view combinations for the same 64 objects and computed the RDM correlation (Spearman) between model layer and brain region.

ScreenShot741

Detail of Figure 1 from the paper: Grayscale stimulus images were created by superimposing 3D models to natural backgrounds. The set strikes an interesting balance between naturalness and control. There were 8 objects from each of 8 categories (animals, boats, cars, chairs, faces, fruits, planes, tables) and each object was presented in 27 or 28 different combinations of position (including entirely nonoverlapping positions), view, and natural background image. For each of the 8 x 8 = 64 objects, they averaged response patterns to all the images that contained it, so as to compute 64 x 64-entry representational dissimilarity matrices (RDMs) using 1-Pearson correlation as the distance measure.

Related previous work: This work is closely related to recent papers by Yamins et al. (2013; 2014), Khaligh-Razavi et al. (2014), Cadieu et al. (2014), and Güçlü & van Gerven (2015). Yamins et al. showed that performance-optimised convolutional network models explain primate-IT neuronal recordings, with models performing better at object recognition also better explaining IT. Khaligh-Razavi et al. compared 37 computational model representations, including the layers of the Krizhevsky et al. (2012) model and a range of popular computer vision features, to human fMRI and monkey recording data (Kiani et al. 2007) and found that only the deep convolutional net, which was extensively trained to emphasize categorical divisions, could fully explain the IT data. They also showed that early visual cortex is well accounted for by earlier layers of the deep convolutional network (and by Gabor representations and other computer vision features). Cadieu et al. (2014) showed that among 6 different models, only Krizhevsky et al. (2012) and an even more powerful deep convolutional network by Zeiler & Fergus (2013) separate the categories in the representational space to a degree comparable to IT cortex. Güçlü & van Gerven (2015) investigated to what extent each layer of the model could explain the representations in each visual area of the ventral stream, finding rough correspondences between lower, intermediate, and higher model representations and early, mid-level, and higher ventral-stream regions, respectively.

How does the present work go beyond previous studies? The most striking novel contribution of this study is the characterisation of how representational similarity to visual areas develops as the neural net’s performance is optimised from a random initialisation. Unlike Yamins et al. (2014) and Cadieu et al. (2014), this study compares a convolutional network to the human ventral stream and, unlike Khaligh-Razavi & Kriegeskorte (2014), each image was presented in many positions and views and with many different backgrounds. The data is from only two subjects, but each subject underwent 9 sessions, so the total data set is substantial. The human fMRI data set is exciting in that it systematically varies category, exemplar, and accidental properties (position, view, background). However, the authors averaged across different images of each of the objects. I wonder if this data set has further potential for future analyses that don’t average across responses to different images.

Comparing many model representations to each of the areas of the visual system is a challenge requiring multiple studies. It’s great to see another study comparing the layers (including pooling layers and intermediate convolutional stages) alongside several control models (V1-like, V2-like, HMAX), which hadn’t been compared to deep convolutional networks before.

ScreenShot756

Figure 2 from the paper: Successive stages of the human ventral stream (V1, V2, V4, LOC) are best explained by successive layers of a deep convolutional neural net model. The representational geometry in V1 most resembles that of a lower and an intermediate layer of the network. The representational geometry in V2 most resembles that of an intermediate layer. And the representational geometries of V4 and LOC most resemble that of a higher layer of the network. Categories are reflected in clusters of response patterns in V4 and even more strongly in LOC. The same holds for higher layers of the network model.

Strengths

Model predictions of brain representational geometries are analysed as a function of model performance. This nicely demonstrates that it is not just the architecture, but performance optimisation that drives successful predictions of representations across all levels of the ventral stream.
Adds to the evidence that deep convolutional neural networks can explain the feedforward component of the stagewise representational transformations in the ventral visual stream.
Rich stimulus set of 1785 images that strikes an interesting balance between naturalism and control, independently varying objects and accidental properties.
Multiple data sets in each subject. This fMRI data set could in the future support tests of a wide variety of models.

Weaknesses

Statistical procedures are not clearly described and not fully justified. What type of generalisation does the crossvalidation scheme test for? What is being bootstrapped? Why are normal and independence assumptions relied on for inference, when bootstrapping the objects would enable straightforward tests that don’t require these assumptions?
The analysis is based on average response patterns across many different images for each object. This renders results more difficult to interpret.
Only two subjects.

Overall, this is a very nice study and a substantial contribution to the literature. However, the averaging across responses elicited by different images complicates the interpretation of the results and the statistical analyses need to be improved, better described, and fully justified – as detailed below. Although the overall results described in this review appear likely to hold up, I am not confident that the inferential results for particular model comparisons are reliable. (If concerns detailed below were substantively addressed, I would consider adjusting the reliability rating.)

Issues to consider addressing in a revision

(1) Can averaged response patterns elicited by different individual images be interpreted?

If we knew a priori that a region represents the objects with perfect invariance to position, view and background, then averaging across many images of the same objects that differ in these variables would make sense. However, we know that none of the regions is really invariant to position, view, and background, and gradually achieving some tolerance is one of the central computational challenges. The averaging will have differential effects in different regions as tolerance increases along the ventral stream. I don’t understand how to interpret the RDM for V1 given that it is based on averaged patterns. The object positions and backgrounds vary widely. Presumably different images of the same object are represented totally differently in V1. The averages should then form a tighter cluster of patterns (by factor 4 after averaging 16 images). Isn’t it puzzling then that the resulting RDM is still significantly correlated with the model? To explain this, do we have to assume that V1 actually represents the objects somewhat tolerantly (perhaps through feedback)? In a high-level representation tolerant to variation of accidental properties and sensitive to categorical differences, we expect the representations of the different images for a given object to be much more similar, so the averaging would have a smaller effect. All this confusion could be avoided by analysing patterns evoked by individual images. In addition, the emergence of tolerance across stages of processing could then be characterised.

(2) What type of generalisation does the crossvalidation scheme support?

Ideally, the crossvalidation should estimate the generalisation performance of the RDM prediction from the model for new images showing different objects. This is not the case here.

First, it appears that the brain data used for training (model weight fitting) and test (estimation of RDM correlation) are responses to the same set of images (all images). The weighting of the model features is estimated using a subset of 15 of the images for each object, and the RDM correlation between model and brain data assessed using 12 different images (different poses and backgrounds) of the same objects. This would seem to fall short of a test of generalisation to new images (even of the same objects) because all images are used (on the side of the brain responses) in the training procedure. Please clarify this issue.
Second, even if there was no overlap in the images used in training and test (on the side of either the model fitting or the brain data), the models are overfitted to the object set. Ideally, nonoverlapping sets of images of different objects should be used for training and testing. How about using a random subset of 4 of the objects in each category (32 in total) for fitting the weights and the other 32 objects for estimating the RDM correlation?

Overall, it seems unclear what type of generalisation these analyses test for. Let’s consider the issue of overfitting to the object set more closely. Currently, the weights w are fitted to 15 of the 27 images for a given object. In an idealised high-level representation invariant to the accidental variation, the two image sets will be identically represented. We expect the object representations to be in general position (no two on a point, no three on a line, no four on a plane and so on). Even if the 64 object representations were not at all clustered by category, but instead distributed randomly, we could linearly read out any categorical distinction and the decoder would generalise to the other 12 images. This is just to illustrate the expected effect of overfitting to the object set. In the present study, weights were fitted to predict RDMs not to discriminate categories. Fitting the 1000 scaling parameters to explain an RDM with 64*(64-1)/2 = about 2K dissimilarities should enable us to fit any RDM quite precisely. I would not be surprised if a noncategorical representation could fit a clearly categorical representation (block-diagonal RDM) in this context. The test-set correlations would then really just be a measure of the replicability of the brain RDMs – rather than a measure of the fit of the model. Regularisation might help ensure that different models are still distinguishable, but it also further complicates interpretation (see below).

Since higher regions are more tolerant, the training and test images are more similarly represented in these regions, and so we would expect greater positive overfitting bias on the estimated RDM correlation for higher regions regardless of the model. It is reassuring that the models still perform differently in LOC. However, the overfitting to the object set complicates interpretation.

The category decoding performance measure is similarly compromised by averaging across different images. Decoding performance as well (if I understood correctly) was tested by averaging different images for each object and training and testing on the same set of objects with different particular images in the test set. So the test is not a test of generalisation to different objects but to different images of the same objects. Again any representation uniquely representing each object (and having at least as many dimensions as the number of objects in both classes combined minus one, which is the case here) will appear to support linear category decoding, even if the distributions in representational space corresponding to the two categories (including the entire populations of objects they comprise) were not at all linearly separable and across-object generalisation performance were at chance level.

(3) Clarify the bootstrapping procedure used in model comparison

The first 6 times the term bootstrap is used, it is entirely unclear what entities are being resampled with replacement. The sampling of 1000 model units is explained in this context, and suggests that this is the resampling with replacement referred to as bootstrapping. Only on page 15 it says: “Our approach bootstraps over independent stimulus samples”. I’m not sure what multiple independent stimulus samples are meant here. Are the objects (averaged across images) resampled? Or are the images resampled? (The latter would necessitate re-estimating the object-average voxel responses to each object for each bootstrap sample.)

(4) Clarify and justify the test used for model comparison

The methods section states:

“Using the bootstrapping above, we computed p-values testing if Layer A better explained visual area X’s RDM than Layer B”

This suggests that a bootstrap test was used to compare models with respect to their RDM prediction performance. But then the model comparison test is described as follows:

“We use Fisher’s r-to-z transformation using Steiger (1980)’s approach to compute p-values for difference in correlation values (Lee and Preacher, 2013). The approach tests for equality of two correlation values from the same sample where one variable is held in common between the two coefficients (in our case, an RDM of a given visual area).”

The Steiger (1980) method for comparing two dependent correlations assumes that the elements of the correlated vectors are sampled independently. As you acknowledge, this is violated for dissimilarities in an RDM. But why then is the Steiger method appropriate? You mention bootstrapping, but don’t explain how your bootstrapping procedure interacts with the Steiger method. Kriegeskorte et al. (2008) and Nili et al. (2014) describe a bootstrapping approach to RDM model comparison that takes the dependencies between dissimilarities into account and does not rely on the Steiger method. Ideally, the objects, not the images, should be resampled with replacement, to simulate variation across objects (not across images of the same objects) and to avoid re-estimation of object-average patterns. Finally, it would be good to use model-comparative inference to support the improvement the RDM explanation of ventral stream regions as performance is optimised by training with backpropagation.

(5) Reconsider the regularisation used in feature-weight fitting

The one-iteration optimisation (motivated as a variant of early stopping) is a very ad-hoc choice of regulariser. I have no idea what prior is implicit to this method. However, this implicit prior is part of the model you are testing and affects the model comparison results. It is even a key component of the model because you are fitting so many parameters that different models might not be distinguishable without this prior.

(6) Show full inferential results with correction for multiple testing

It would be great if the figures showed which RDM correlations are significant and which pairs are significantly different. In addition, it would be good to account for multiple testing. Nonparametric methods for testing and comparing RDM correlations are described in Nili et al. (2014).

(7) Show noise ceilings

It would be good to see whether the model layers fully or only partially explain the explainable component of the variance in the RDMs. This could yield the insight that the model does fall short given the present study’s data set. It would be interesting then to learn how it falls short and this would motivate future changes to the model. Alternatively, if the model reaches the noise ceiling, we would learn that we need to get better or more data to find out how the model still falls short. The methods section suggests that a noise ceiling was estimated for Figure 6, stating:

“To avoid the problem of finding linear re-weightings using smaller sub-sets of our data, we instead computed noise ceilings and percent explained variance values (Figure 6) without using the weighting procedure described above. Noise ceilings for each visual area were computed by splitting the runs of our data into two non-overlapping groups. With each group, we estimated stimulus responses (beta weights) using the procedure described above (see the Image responses section) and computed object-averaged RDMs for each visual area. We used the correlation between the RDM from each of the two groups as our noise ceiling for percent explained variance estimates (Figure 6).”

However, Figure 6 and its legend don’t mention a noise ceiling. What is the noise ceiling in these analyses? Figure 2 would also benefit from noise ceilings for each of the brain areas. In addition, the split-half correlation should underestimate the noise ceiling because half the data is used and both RDMs are affected by noise. The noise ceiling computation should instead give an estimate of the expected performance of a noiseless true model or upper and lower bounds on this performance (Nili et al. 2014, Khaligh-Razavi & Kriegeskorte 2015).

(8) What is the function of the two branches of the model?

Clarify the function of the two branches of the model in the legend of Figure 5 and in the methods section. A single GPU was used for training here. Did this serve to keep the architecture consistent with Krizhevsky et al. (2012)?

(9) Why are the ROIs so big?

As far as I remember, a normal size for LO or FFA is below 1 ml. LO1, LO2, and FFA have 234, 299, and 292 voxels (pooled across two subjects), corresponding to 3 to 4 ml on average across subjects (given that voxels were 3 mm isotropic).

(10) Add a colour legend to Figure 4

This would help the reader quickly understand the meaning of the lines without having to refer to the text description.

Do view-invariant brain representations of actions arise within 200 ms of viewing?

[R7I7]

Humans can rapidly visually recognise the actions people around them are engaged in and this ability is important for successful behaviour and social interaction. Isik et al. presented human subjects with 2-second video clips of humans performing actions while measuring brain activity with MEG. The clips comprised 5 actions (walk, run, jump, eat, drink) performed by each of five different actors and video-recorded from each of five different views (only frontal and profile used in MEG). Results show that action can be decoded from MEG signals arising about 200 ms after the onset of the video, with decoding accuracy peaking after about 500 ms and then decaying while the stimulus is still on, with a rebound after stimulus offset. Moreover, decoders generalise across actors and views. The authors conclude that the brain rapidly computes a representation that is invariant to view and actor.

ScreenShot738

Figure from the paper. Legend from the paper with my modifications in brackets: [Accuracy of action decoding (%) from MEG data as a function of time after video onset]. We can decode [which of five actions was being performed in the video clip] by training and testing on the same view (‘within-view’ condition), or, to test viewpoint invariance, training on one view (0 degrees [frontal, I think, but this should be clarified] or 90 degrees [profile]) and testing on the second view (‘across view’ condition). Results are each [sic] from the average of eight different subjects. Error bars represent standard deviation [across what?]. Horizontal line indicates chance decoding accuracy. […] Lines at the bottom of plot indicate significance with p<0.01 permutation test, with the thickness of the line indicating [for how many of the 8 subjects decoding was significant]. [Note the significant offset response after the 2-s video (whose duration should be indicated by a stimulus bar).]

The rapid view-invariant action decoding is really quite amazing. It would be good to see more detailed analyses to assess the nature of the signals enabling this decoding feat. Of course, 200 ms already allows for recurrent computations and the decodability peak is at 500 ms, so this is not strong evidence for a pure feedforward account.

The generalisation across actors is less surprising. This was a very controlled data set. Despite some variation in the appearance of the actors, it seems plausible that there would be some clustering of the vectors of pixels across space and time (or of features of a low-level visual representation) corresponding to different actors performing the same action seen from the same angle.

In separate experiments, the authors used static single frames taken from the videos and dynamic point-light figures as stimuli. These reduced form-only and motion-only stimuli were associated with diminished separation of actions in the human brain and in model representations, and with diminished human action recognition, suggesting that form and motion information are both essential to action recognition.

I’m wondering about the role of task-related priors. Subjects were performing an action recognition task on this controlled set of brief clips during MEG while freely viewing the clips (though this is not currently clearly stated). This task is likely to give rise to strong prior expectations about the stimulus (0 deg or 90 deg, one of five actions, known scale and positions of key features for action discrimination). Primed to attend to particular diagnostic features and to fixate in certain positions, the brain will configure itself for rapid dynamic discrimination among the five possible actions. The authors present a group-average analysis of eye movements, suggesting that these do not provide as much information about the actions as the MEG signal. However, the low-dimensional nature of the task space is in contrast to natural conditions, where a wider variety of actions can be observed and view, actor, size, and background vary more. The precise prior expectations might contribute to the rapid discriminability of the actions in the brain signals.

The authors model the results in the framework of feedforward processing in a Hubel-and-Wiesel/Poggio-style model that alternates convolution and max-pooling to simulate responses resembling simple and complex cells, respectively. This model is extended here to process video using spatiotemporal filter templates. The first layer uses Gabor filters, higher layers use templates in the first layer matching video clips in the stimulus set. The authors argue that this model supports invariant decoding and largely accounts for the MEG results.

Like the subjects, the model is set up to process the restricted stimulus space. The internal features of the model were constructed using representational fragments from samples from the same parametric space of videos. The exact videos used to test the models were not used for constructing the feature set. However, if I understood correctly, videos from the same restricted space (5 actions, 5 actors, 5 views) were used. Whether the model can be taken to explain (at a high level of abstraction) the computations performed by the brain depends critically on the degree to which the model is not specifically constructed for the (necessarily very limited) 5-action controlled stimulus space used in the study.

As the authors note, humans still outperform computer vision models at action recognition. How does the authors’ own model perform on less controlled action videos? If it the model cannot perform the task on real-world sensory input, can we be confident that it captures the way that the human brain performs the task? This is a concern in many studies and not trivial to address. However, the interpretation of the results should engage this issue.

Strengths

Controlled stimulus set: The set of video stimuli (5 actions x 5 actors x 5 views x 26 clips = 3250 2-sec clips) is impressive. Assembling this set is an important contribution to the field. The set is condition-rich (compared to typical stimulus sets used in cognitive neuroscience studies) and seems to strike a good balance between control and realism. This set could be a driver of progress if it were to be used in multiple modelling and empirical studies.
Combination of brain-activity measurements and a simple computational model, which provides a useful starting point for modelling the recognition of dynamic actions, as it is minimal and standard in many respects: a feedforward model in the HMAX framework, extended from spatial to spatiotemporal filters.

Weaknesses

Controlled stimulus set: The set of video stimuli is very restricted compared to real-world action recognition. For the brain data, this means that subjects might have rapidly formed priors about the stimuli, enabling them to configure their visual systems (e.g. attentional templates, fixation targets) for more rapid recognition of the 5 actions than is possible in real-world settings. This limitation is shared with many studies in our field and difficult to overcome without giving up control (which is a strength, see above). I therefore suggest addressing this problem in the discussion.
The model uses features based on spatiotemporal patterns sampled from the same restricted stimulus space. Although non-identical clips were used, the videos underlying the representational space appear to share a lot with the experimental stimuli (same 5 actions, same 5 views, same background?, same actors?). I would therefore not expect this model to work well on arbitrary real-world action video clips. This is in contrast to recent studies using deep convolutional neural nets (e.g. Khaligh-Razavi & Kriegeskorte 2014), where the models were trained without any information about the (necessarily restricted) brain-experimental stimulus set and can perform recognition under real-world conditions.
Only one model (in two variants) is tested. In order to learn about computational mechanism, it would be good to test more models.
MEG data were acquired during viewing of only 50 of the clips (5 actions x 5 actors x 2 views).
Missing inferential analyses: While the authors employ inferential analyses in single subjects and report number of significant subjects, few hypotheses of interest are formally statistically tested. The effects interpreted appear quite strong, so the results described above appear solid nevertheless (interpretational caveats notwithstanding).

Overall evaluation

This is an ambitious study describing results of a well-designed experiment using a stimulus set that is a major step forward. The results are an interesting and substantial contribution to the literature. However, the analyses could be much richer than they currently are and the interpretation of the results is not straightforward. Stimulus-set-induced priors may have affected both the neural processing measured and the model (which used templates from stimuli within the controlled video set). Results should be interpreted more cautiously in this context.

Although feedforward processing is an important part of the story, it is not the whole story. Recurrent signal flow is ubiquitous in the brain and essential to brain function. In engineering, similarly, recurrent neural networks are beginning to dominate spatiotemporal processing challenges such as speech and video recognition. The fact that the MEG data are presented as time courses, revealing a rich temporal structure, and the model analyses are bar graphs illustrates the key limitation of the model.

It would be great to extend the analyses to reveal a richer picture of the temporal dynamics. This should include an analysis of the extent to which each model layer can explain the representational geometry at each latency from stimulus onset.

Future directions

In revision or future studies, this line of work could be extended in a number of ways:

Use multiple models that can handle real-world action videos. The authors’ controlled video set is extremely useful for testing human and model representations, and for comparing humans to models. However, to be able to draw strong conclusions, the models, like the humans, would have to be trained to recognise human actions under real-world conditions (unrestricted natural video). In addition, it would be good to compare the biological representational dynamics to both feedforward and recurrent computational models.
To overcome the problem of stimulus-set related priors, which make it difficult to compare representational dynamics measured for restricted stimulus sets to real-world recognition in biological brains, one could present a large set of stimuli without ever presenting a stimulus twice to the same subject. Would the action category still be decodable at 150 ms with generalisation across views? Would a feedforward computer vision model trained on real-world action videos be able to predict the representational dynamics?
The MEG analyses could use source reconstruction to enable separate analyses of the representational dynamics in different brain regions.
It would be useful to have MEG data for the full stimulus set of 5 actions x 5 actors x 5 viewpoints = 125 conditions. The representational geometries could be analysed in detail to reveal which particular action pairs become discriminable when with what level of invariance.

Particular suggestions for improvements of this paper

(1) Present more detailed results

It would be good to see results separately for each pair of actions and each direction of crossdecoding (0 deg training -> 90 deg testing, and 90 deg training -> 0 deg testing). Regarding the former, eating and drinking involve very similar body postures and motions. Is this reflected in the discriminability of these actions?

Regarding, the decoding generalisation across views, you state:

“We decoded by training only on one view (0 degrees or 90 degrees), and testing on a second view (0 degrees or 90 degrees).”

Was the training set exclusively composed on 0 degree (frontal?) and the test set exclusively of 90 degree (side view?), and vice versa? In case the test set contained instances of both views (though of course, not for the same actor and action), results are more difficult to interpret.

(2) Discuss the caveats to the current interpretation of the results

Discuss the question whether priors resulting from subjects understanding of the restricted stimulus set might have affected the processing of the stimuli. Consider the involvement of recurrent computations within 200 ms and discuss the continuing rise of decodability until 500 ms. Discuss the possibility that the model will not generalise to action recognition in the wild.

(3) Test several control models

Can Gabor, HMAX, and deep convolutional neural net models support similarly invariant action decoding? These models are relatively easy to test, so I think it’s worth considering this for revision. Computer vision models trained on dynamic action recognition could be left to future studies.

(4) Test models by comparison of its representations with the brain representations

The computational model is currently only compared to the data at the very abstract level of decoding accuracy. Can the model predict the representations and representational dynamics in detail? It might be difficult to use the model to predict the measured channels. This would require the fitting of a linear model predicting the measured channels from the model units and the MEG data (acquired for only 5 actions x 5 actors x 2 views = 50 conditions) might be insufficient. However, representational dynamics could be investigated in the framework of representational similarity analysis (50 x 50 representational dissimilarity matrices) following Carlson et al. (2013) and Cichy et al. (2014). Note that this approach does not require fitting a prediction model and so appears applicable here. Either approach would reveal the dynamic prediction of the feedforward model (given dynamic inputs) and where its prediction diverges from the more complex and recurrent processes in the brain. This would promise to give us a richer and less purely confirmatory picture of the data and might show the merits and limitations of a feedforward account.

(5) Perform temporal cross-decoding

Temporal crossdecoding (Carlson et al. 2013, Cichy et al. 2014) could be used to more richly characterise the representational dynamics. This would reveal whether representations stabilise in certain time windows, or keep tumbling through the representational space even as stimuli are continuously decodable.

(6) Improve the inferential analyses

I don’t really understand the inference procedure in detail from the description in the methods section.

“We recorded the peak decoding accuracy for each time bin,…”

What is the peak decoding accuracy for each time bin? Is this the maximum accuracy across subjects for each time bin?

“…and used the null distribution of peak accuracies to select a threshold where decoding results performing above all points in the null distribution for the corresponding time point were deemed significant with P < 0.01 (1/100).”

I’m confused after reading this, because I don’t understand what is meant by “peak”.

The inference procedure for the decoding-accuracy time courses seems to lack formal multiple-testing correction across time points. Given enough subjects, inference could be performed with subject as a random effect. Alternatively, fixed-effects inference could be performed by permutation, averaging across subjects. Multiple testing across latencies should be formally corrected for. A simple way to do this is to relabel the experimental events once, compute an entire decoding time course, and record the peak decoding accuracy across time (or if this is what was done, it should be clearly described). Through repeated permutations, a null distribution of peak accuracies can be constructed and a threshold selected that is exceeded anywhere under H0 with only 5% probability, thus controlling the familywise error rate at 5%. This threshold could be shown as a line or as the upper edge of a transparent rectangle that partially obscures the insignificant part of the curve.

For each inferential analysis, please describe exactly what the null hypothesis was, what event-labels are exchangeable under this null hypothesis, and how the null distribution was computed. Also, explain how the permutation test interacted with the crossvalidation procedure. The crossvalidation should ideally generalise to new stimuli and label permutation be wrapped around this entire procedure.

“Decoding analysis was performed using cross validation, where the classifier was trained on a randomly selected subset of 80% of data for each stimulus and tested on the held out 20%, to assess the classifier’s decoding accuracy.”

Does this apply only to the within-view decoding? In the critical decoding analysis with generalisation across views, it cannot have been 20% of the data in the held-out set, since 0-deg views were used for training and 90-deg views for testing (and vice versa). If only 50% of the data were used for training there, why didn’t performance suffer given the smaller training set compared to the within-view decoding?

It would also be good to have estimates and inferential comparisons of the onset and peak latencies of the decoding time courses. Inference could be performed on a set of single-subject latency differences between two conditions modelling subject as a random effect.

(7) Qualify claims about biological fidelity of the model

The model is not really “designed to closely mimic the biology of the visual system”, rather its architecture is inspired by some of the features of the feedforward component of the visual hierarchy, such as local receptive fields of increasing size across a hierarchy of representations.

(8) Open stimuli and data

This work would be especially useful to the community if the video stimuli and the MEG data were made openly available. To fully interpret the findings, it would also be good to be able to view the movie clips online.

(9) Further clarify the title

The title “Fast, invariant representation for human action in the visual system” is somewhat unclear. What is meant are representations of perceived human actions, not representations for action. “Fast, invariant representation of visually perceived human actions” would be better, for example.

(10) Clarify what stimuli MEG data were acquired for

The abstract states “We use magnetoencephalography (MEG) decoding and a computational model to study action recognition from a novel dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by five actors at five viewpoints.” This suggests that MEG data were acquired for all these conditions. The methods section clarifies that MEG data were only recorded for 50 conditions (5 actions x 5 actors x 2 views). Here and in the legend of Fig. 1, it would be better to use the term “stimulus set” in place of “data set”.

(11) Clarify whether subjects were fixating or free viewing

Were subjects free viewing or fixating? This should be explicitly stated and the choice motivated in either case.

(12) Make figures more accessible

The figures are not optimal. Every panel showing decoding results should be clearly labelled to state what variables the crossvalidation procedure tested for generalisations across. For example, a label (in the figure itself!) could be: “decoding brain representations of actions with invariance to actor and view”. The reader shouldn’t have to search in the legend to find this essential information. Also every figure should have a stimulus bar depicting the period of stimulus presence. This is important especially to assess stimulus-offset-related effects, which appear to be present and significant.

Fig. 3 is great. I think it would be clearer to replace “space-time dot product” with “space-time convolution”.

(13) Clarify what the error bars represent

“Error bars represent standard deviation.”

Is this the standard deviation across the 8 subjects? Is it really the standard deviation or the standard error?

(14) Clarify what we learn from the comparison between the structured and the unstructured model

For the unstructured model, won’t the machine learning classifier learn to select random combinations that tend to pool across different views of one action? This would render the resulting system computationally similar.

	Anna Ivanova on What type of linear or nonline…
	BCI papers – T… on From bidirectional brain-compu…
	Firsts: Submitting a… on The selfish scientist’s…
	On optimal measures… on What’s the best measure of rep…
	What the “Kany… on What’s the best measure of rep…