Is a cow-mug a cow to the ventral stream, and a mug to a deep neural network?

[I7R7]

An elegant new study by Bracci, Kalfas & Op de Beeck (pp2018) suggests that the prominent division between animate and inanimate things in the human ventral stream’s representational space is based on a superficial analysis of visual appearance, rather than on a deeper analysis of whether the thing before us is a living thing or a lifeless object.

Bracci et al. assembled a beautiful set of stimuli divided into 9 equivalent triads (Figure 1). Each triad consists of an animal, a manmade object, and a kind of hybrid of the two: an artefact of the same category and function as the object, designed to resemble the animal in the triad.

Screen Shot 08-16-18 at 05.52 PM 001
Figure 1: The entire set of 9 triads = 27 stimuli. Detail from Figure 1 of the paper.

 

Bracci et al. measured response patterns to each of the 27 stimuli (stimulus duration: 1.5 s) using functional magnetic resonance imaging (fMRI) with blood-oxygen-level-dependent (BOLD) contrast and voxels of 3-mm width in each dimension. Sixteen subjects viewed the images in the scanner while performing each of two tasks: categorizing the images as depicting something that looks like an animal or not (task 1) and categorizing the images as depicting a real living animal or a lifeless artefact (task 2).

The authors performed representational similarity analysis, computing representational dissimilarity matrices (RDMs) using the correlation distance (1 – Pearson correlation between spatial response patterns). They averaged representational dissimilarities of the same kind (e.g. between the animal and the corresponding hybrid) across the 9 triads. To compare different kinds of representational distance, they used ANOVAs and t tests to perform inference (treating the subject variable as a random effect). They also studied the representations of the stimuli in the last fully connected layers of two deep neural networks (DNNs; VGG-19, GoogLeNet) trained to classify objects, and in human similarity judgments. For the DNNs and human judgements, they used stimulus bootstrapping (treating the stimulus variable as a random effect) to perform inference.

Results of a series of well-motivated analyses are summarized in Figure 2 below (not in the paper). The most striking finding is that while human judgments and DNN last-layer representations are dominated by the living/nonliving distinction, human ventral temporal cortex (VTC) appears to care more about appearance: the hybrid animal-lookalike objects, despite being lifeless artefacts, fall closer to the animals than to the objects. In addition, the authors find:

  • Clusters of animals, hybrids, and objects: In VTC, animals, hybrids, and objects form significantly distinct clusters (average within-cluster dissimilarity < average between-cluster dissimilarity for all three pairs of categories). In DNNs and behavioral judgments, by contrast, the hybrids and the objects do not form significantly distinct clusters (but animals form a separate cluster from hybrids and from objects).
  • Matching of animals to corresponding hybrids: In VTC, the distance between a hybrid animal-lookalike and the corresponding animal is significantly smaller than that between a hydrid animal-lookalike and a non-matching animal. This indicates that VTC discriminates the animals and animal-lookalikes and (at least to some extent) matches the lookalikes to the correct animals. This effect was also present in the similarity judgments and DNNs. However, the latter two similarly matched the hybrids up with their corresponding objects, which was not a significant effect in VTC.

 

Screen Shot 08-16-18 at 05.52 PM
Figure 2: A qualitative visual summary of the results. Connection lines indicate different kinds of representational dissimilarity, illustrated for two triads although estimates and tests are based on averages across all 9 triads. Gray underlays indicate clusters (average within-cluster dissimilarity < average between-cluster dissimilarity, significant). Arcs indicate significantly different representational dissimilarities. It would be great if the authors added a figure like this in the revision of the paper. However, unlike the mock-up above, it should be a quantitatively accurate multidimensional scaling (MDS, metric stress) arrangement, ideally based on unbiased crossvalidated representational dissimilarity estimates.

 

The effect of the categorization task on the VTC representation was subtle or absent, consistent with other recent studies (cf. Nastase et al. 2017, open review). The representation appears to be mostly stimulus driven.

The results of Bracci et al. are consistent with the idea that the ventral stream transforms images into a semantic representation by computing features that are grounded in visual appearance, but correlated with categories (Jozwik et al. 2015). VTC might be 5-10 nonlinear transformations removed from the image. While it may emphasize visual features that help with categorization, it might not be the stage where all the evidence is put together for our final assessment of what we’re looking at. VTC, thus, is fooled by these fun artefacts, and that might be what makes them so charming.

Although this interpretation is plausible enough and straightforward, I am left with some lingering thoughts to the contrary.

What if things were the other way round? Instead of DNNs judging correctly where VTC is fooled, what if VTC had a special ability that the DNNs lack: to see the analogy between the cow and the cow-mug, to map the mug onto the cow? The “visual appearance” interpretation is based on the deceptively obvious assumption that the cow-mug (for example) “looks like” a cow. One might, equally compellingly, argue that it looks like a mug: it’s glossy, it’s conical, it has a handle. VTC, then, does not fail to see the difference between the fake animal and the real animal (in fact these categories do cluster in VTC). Rather it succeeds at making the analogy, at mapping that handle onto the tail of a cow, which is perhaps an example of a cognitive feat beyond current AI.

Bracci et al.’s results are thought-provoking and the study looks set to inspire computational and empirical follow-up research that links vision to cognition and brain representations to deep neural network models.

 

Strengths

  • addresses an important question
  • elegant design with beautiful stimulus set
  • well-motivated and comprehensive analyses
  • interesting and thought-provoking results
  • two categorization tasks, promoting either the living/nonliving or the animal-appearance/non-animal appearance division
  • behavioral similarity judgment data
  • information-based searchlight mapping, providing a broader view of the effects
  • new data set to be shared with the community

 

Weaknesses

  • representational geometry analyses, though reasonable, are suboptimal
  • no detailed analyses of DNN representations (only the last fully connected layers shown, which are not expected to best model the ventral stream) or the degree to which they can explain the VTC representation
  • only three ROIs (V1, posterior VTC, anterior VTC)
  • correlation distance used to measure representational distances (making it difficult to assess which individual representational distances are significantly different from zero, which appears important here)

 

Suggestions for improvement

The analyses are effective and support most of the claims made. However, to push this study from good to excellent, I suggest the following improvements.

 

Major points

Improved representational-geometry analysis

The key representational dissimilarities needed to address the questions of this study are labeled a-g in Figure 2. It would be great to see these seven quantities estimated, tested for deviation from 0, and all 7 choose 2 = 21 pairwise comparisons tested. This would address which distinctions are significant and enable addressing all the questions with a consistent approach, rather than combining many qualitatively different statistics (including clustering index, identity index, and model RDM correlation).

With the correlation distance, this would require a split-data RDM approach, consistent with the present approach, but using the repeated response measurements to the same stimulus to estimate and remove the positive bias of the correlation-distance estimates. However, a better approach would be to use a crossvalidated distance estimator (more details below).

 

Multidimensional scaling (MDS) to visualize representational geometries

This study has 27 unique stimuli, a number well suited for visualization of the representational geometries by MDS. To appreciate the differences between the triads (each of which has unique features), it would be great to see an MDS of all 27 objects and perhaps also MDS arrangements of subsets, e.g. each triad or pairs of triads (so as to reduce distortions due to dimensionality reduction).

Most importantly, the key representational dissimilarities a-g can be visualized in a single MDS as shown in Figure 2 above, using two triads to illustrate the triad-averaged representational geometry (showing average within- and between-triad distances among the three types of object). The MDS could use 2 or 3 dimensions, depending on which variant better visually conveys the actual dissimilarity estimates.

 

Crossvalidated distance estimators

The correlation distance is not an ideal dissimilarity measure because a large correlation distance does not indicate that two stimuli are distinctly represented. If a region does not respond to either stimulus, for example, the correlation of the two patterns (due to noise) will be close to 0 and the correlation distance will be close to 1, a high value that can be mistaken as indicating a decodable stimulus pair.

Crossvalidated distances such as the linear-discriminant t value (LD-t; Kriegeskorte et al. 2007, Nili et al. 2014) or the crossnobis distance (also known as the linear discriminant contrast, LDC; Walther et al. 2016) would be preferable. Like decoding accuracy, they use crossvalidation to remove bias (due to overfitting) and indicate that the two stimuli are distinctly encoded. Unlike decoding accuracy, they are continuous and nonsaturating, which makes them more sensitive and a better way to characterize representational geometries.

Since the LD-t and the crossnobis distance estimators are symmetrically distributed about 0 under the null hypothesis (H0: response patterns drawn from the same distribution), it would be straightforward to test these distances (and averages over sets of them) for deviation from 0, treating subjects and/or stimuli as random effects, and using t tests, ANOVAs, or nonparametric alternatives. Comparing different dissimilarities or set-average dissimilarities is similarly straightforward.

 

Linear crossdecoding with generalization across triads

An additional analysis that would give complementary information is linear decoding of categorical divisions with generalization across stimuli. A good approach would be leave-one-triad-out linear classification of:

  • living versus nonliving
  • things that look like animals versus other things
  • animal-lookalikes versus other things
  • animals versus animal-lookalikes
  • animals versus objects
  • animal-lookalikes versus objects

This might work for devisions that do not show clustering (within dissimilarity < between dissimilarity), which would indicate linear separability in the absence of compact clusters.

For the living/nonliving destinction, for example, the linear discriminant would select responses that are not confounded by animal-like appearance (as most VTC responses seem to be), responses that distinguish living things from animal-lookalike objects. This analysis would provide a good test of the existence of such responses in VTC.

 

More layers of the two DNNs

To assess the hypothesis that VTC computes features that are more visual than semantic with DNNs, it would be useful to include an analysis of all the layers of each of the two DNNs, and to test whether weighted combinations of layers can explain the VTC representational geometry (cf. Khaligh-Razavi & Kriegeskorte 2014).

 

More ROIs

How do these effects look in V2, V4, LOC, FFA, EBA, and PPA?

 

Minor points

The use of the term “bias” in the abstract and main text is nonstandard and didn’t make sense to me. Bias only makes sense when we have some definition of what the absence of bias would mean. Similarly the use of “veridical” in the abstract doesn’t make sense. There is no norm against which to judge veridicality.

 

The polar plots are entirely unmotivated. There is no cyclic structure or even meaningful order to the the 9 triads.

 

“DNNs are very good, and even better than than human visual cortex, at identifying a cow-mug as being a mug — not a cow.” This is not a defensible claim for several reasons, each of which by itself suffices to invalidate this.

  • fMRI does not reveal all the information in cortex.
  • VTC is not all of visual cortex.
  • VTC does cluster animals separately from animal-lookalikes and from objects.
  • Linear readout of animacy (cross-validated across triads) might further reveal that the distinction is present (even if it is not dominant in the representational geometry.

 

 

Grammar, typos

“how an object looks like” -> ‘how an object looks” or “what an object looks like”

“as oppose to” -> “as opposed to”

“where observed” -> “were observed”

 

Advertisements

Is the radial orientation-preference map in V1 an artefact of “vignetting”?

[I6 R8]

The orientation of a visual grating can be decoded from fMRI response patterns in primary visual cortex (Kamitani & Tong 2005, Haynes & Rees 2005). This was surprising because fMRI voxels in these studies are 3 mm wide in each dimension and thus average over many columns of neurons that respond to different orientations. Since then, many studies have sought to clarify why fMRI orientation decoding works so well.

The first explanation given was that even though much of the contrast of the neuronal orientation signals might cancel out in the averaging within each voxel, any given voxel might retain a slight bias toward certain orientations if it didn’t sample all the columns exactly equally (Kamitani & Tong 2005, Boynton 2005). By integrating the evidence across many slightly biased voxels with a linear decoder, it should then be possible to guess, better than chance, the orientation of the stimulus.

Later work explored how random orientation biases might arise in the voxels. If each voxel directly sampled the orientation columns (computing an average within its cuboid boundaries), then decoding success should be very sensitively dependent on the alignment of the voxels between training and test sets. A shift of the voxel grid on the scale of the width of an orientation column would change the voxel biases and abolish decoding success. Several groups have argued that the biases might arise at the level of the vasculature (Gardner et al. 2009, Kriegeskorte et al. 2009). This would make the biases enabling orientation decoding less sensitive to slight shifts of the voxel grid. Moreover, if voxels reflected signals sampled through the fine-grained vasculature, then it would be the vasculature, not the voxel grid that determines to what extent different spatial frequencies of the underlying neuronal activity patterns are reflected in the fMRI patterns (Kriegeskorte et al. 2009).

Another account (Op de Beeck 2010, Freeman et al. 2011) proposed that decoding may rely exclusively on coarse-scale spatial patterns of activity. In particular, Freeman BrouwerHeeger and Merriam (2011) argued that radial orientations (those aligned with a line that passes through the point of fixation) are over-represented in the neural population. If this were the case, then a grating would elicit a coarse-scale response pattern across its representation in V1, in which the neurons representing edges pointing (approximately) at fixation are more strongly active. There is indeed evidence from multiple studies for a nonuniform representation of orientations in V1 (Furmanski & Engel 2000, Sasaki et al., 2006, Serences et al. 2009, Mannion et al. 2010), perhaps reflecting the nonuniform probability distribution of orientation in natural visual experience. The over-representation of radial orientations might help explain the decodability of gratings. However, opposite-sense spirals (whose orientations are balanced about the radial orientation) are also decodable (Mannion et al. 2009, Alink et al. 2013). This might be due to a simultaneous over-representation of vertical orientations (Freeman et al. 2013, but see Alink et al. 2013).

There’s evidence in favor of a contribution to orientation decoding of both coarse-scale (Op de Beeck 2010, Freeman et al. 2011, Freeman et al. 2013) and fine-scale components of the fMRI patterns (e.g. Shmuel et al. 2010, Swisher et al. 2010, Alink et al. 2013, Pratte et al. 2016, Alink et al. 2017).

Note that both coarse-scale and fine-scale pattern accounts suggest that voxels have biases in favor of certain orientations. A entirely novel line of argument was introduced to the debate by Carlson (2014).

Carlson (2014) argued, on the basis of simulation results, that even if every voxel sampled a set of filters uniformly representing all orientations (i.e. without any bias), the resulting fMRI patterns could still reflect the orientation of a grating confined to a circular annulus (as standardly used in the literature). The reason lies in “the interaction between the stimulus region and the empty background” (Carlson 2014), an effect of the relative orientations of the grating and the edge of the aperture (the annulus within which the grating is visible). Carlson’s simulations showed that the average response of a uniform set of Gabor orientation filters is larger where the aperture edge is orthogonal to the grating. He also showed that the effect does not depend on whether the aperture edge is hard or soft (fading contrast). Because the voxels in this account have no biases in favor of particular orientations, Carlson aptly referred to his account as an “unbiased” perspective.

The aperture edge adds edge energy. The effect is strongest when the edge is orthogonal to the carrier grating orientation. We can understand this in terms of the Fourier spectrum. Whereas a sine grating has a concentrated representation in the 2D Fourier amplitude spectrum, the energy is more spread out when an aperture limits the extent of the grating, with the effect depending on the relative orientations of grating and edge.

For an intuition on how this kind of thing can happen, consider a particularly simple scenario, where a coarse rectangular grating is limited by a sharp aperture whose edge is orthogonal to the grating. V1 cells with small receptive fields will respond to the edge itself as well as to the grating. When edge and grating are orthogonal, the widest range of orientation-selective V1 cells is driven. However, the effect is present also for sinusoidal gratings and soft apertures, where contrast fades gradually, e.g. according to a raised half-cosine.

An elegant new study by Roth, Heeger, and Merriam (pp2018) now follows up on the idea of Carlson (2014) with fMRI at 3T and 7T. Roth et al. refer to the interaction between the edge and the content of the aperture as “vignetting” and used apertures composed of either multiple annuli or multiple radial rays. These finer-grained apertures spread the vignetting effect all throughout the stimulated portion of the visual field and so are well suited to demonstrate the effect on fMRI patterns.

Roth et al. present simulations (Figure 1), following Carlson (2014) and assuming that every voxel uniformly samples all orientations. They confirm Carlson’s account and show that the grating stimuli the group used earlier in Freeman et al. (2011) are expected to produce the stronger response to radial parts of the grating, where the aperture edge is orthogonal to the grating — even without any over-representation of radial orientations by the neurons.

Freeman et al. (2011) used a relatively narrow annulus (inner edge: 4.5º, outer edge: 9.5º eccentricity from fixation), where no part of the grating is far from the edge. This causes the vignetting effect to create the appearance of a radial bias that is strongest at the edges but present even in the central part of the annular aperture (Figure 1, bottom right). Roth et al.’s findings suggest that the group’s earlier result might reflect vignetting, rather than (or in addition to) a radial bias of the V1 neurons.

Screen Shot 05-09-18 at 10.53 PM
Figure 1: Vignetting explains findings of Freeman et al. (2011). Top: Voxel orientation preferences and pRF locations. Each element represents a voxel, its position represents the visual-field location of the voxel’s population receptive field (pRF), the orientation of the line segment represents the voxel’s preferred orientation. The size and color of each element reflects the degree to which the voxel showed a reliable orientation-dependent response (coherence). The pattern suggests that many voxels prefer radial orientations, i.e. those pointing at fixation. Bottom: Roth et al. (pp2018), following Carlson (2014), applied a Gabor model to the stimuli of Freeman et al. (2011). They then simulated voxels pooling orientation-selective responses without any bias in favor of particular orientations. The simulation shows that the apparent radial bias arises as an artefact of the edge-effects described by Carlson (2014), termed “vignetting” by Roth et al. (pp2018). Dashed lines show the edges of the stimulus.

 

Roth et al. use simulations also to show that their new stimuli, in which the aperture consists of multiple annuli or multiple radial rays, predict coarse-scale patterns across V1. They then demonstrate in single subjects measured with fMRI at 3T and 7T that V1 responds with the globally modulated patterns predicted by the account of Carlson (2014).

The study is beautifully designed and expertly executed. Results compellingly demonstrate that, as proposed by Carlson (2014), vignetting can account for the coarse-scale biases reported in Freeman et al. (2011). The paper also contains a careful discussion that places the phenomenon in a broader context. Vignetting describes a family of effects related to aperture edges and their interaction with the contents of the aperture. The interaction could be as simple as the aperture edge adding edge energy of a different orientation and thus changing orientation-selective response. It could also involve extra-receptive-field effects such as non-isotropic surround suppression.

The study leaves me with two questions:

  • Is the radial orientation-preference map in V1, as described in Freeman et al. (2011), entirely an artefact of vignetting (or is there still also an over-representation of radial orientations in the neuronal population)?
  • Does vignetting also explain fMRI orientation signals in studies that use larger oriented gratings, where much of the grating is further from the edge of the aperture, as in Kamitani & Tong (2005)?

The original study by Kamitani and Tong (2005) used a wider annular aperture reaching further into the central region, where receptive fields are smaller (inner edge: 1.5°, outer edge: 10° eccentricity from fixation). The interior parts of the stimulus may therefore not be affected by vignetting. Importantly, Wardle, Ritchie, Seymour, and Carlson (2017) already investigated this issue and their results suggest that vignetting is not necessary for orientation decoding.

It would be useful to analyze the stimuli used by Kamitani & Tong (2005) with a Gabor model (with reasonable choices for the filter sizes). As a second step, it would be good to reanalyze the data from Kamitani & Tong (2005), or from a similar design. The analysis should focus on small contiguous ROIs in V1 of the left and right hemisphere that represent regions of the visual field far from the edge of the aperture.

Going forward, perhaps we can pursue the issue in the spirit of open science. We would acquire fMRI data with maximally large gratings, so that regions unaffected by vignetting can be analyzed (Figure 2). The experiments should include localizers for the aperture margins (transparent blue) and for ROIs perched on the horizontal meridian far from the aperture edges (transparent red). The minimal experiment would contain two grating orientations (45º and -45º as shown at the bottom), each presented with many different phases. Note that, for the ROIs shown in Figure 2, these two orientations mimimize undesired voxel biases due to radial and vertical orientation preferences (both gratings have equal angle to the radial orientation and equal angle to the vertical orientation). Note also that these two orientations have equal angle to the aperture edge, thus also minimizing any residual long-range vignetting effect that acts across the safety margin.

The analysis of the ROIs should follow Alink et al. (2017): In each ROI (left hemisphere, right hemisphere), we use a training set of fMRI runs to define two sets of voxels: 45º-preferring and -45º-preferring voxels. We then use the test set of fMRI runs to check, independently for the two voxel sets, whether the preferences replicate. We could implement a sensitive test along these lines by training and testing a linear decoder on just the 45º-preferring voxels, and then another linear decoder on just the -45º-preferring voxels. If both of these decoders have significant accuracy on the test set, we have established that voxels of opposite selectivity intermingle within the same small ROI, indicating fine-grained pattern information.

pi-benchmark
Figure 2: Simple stimuli for benchmarking fMRI acquisition schemes (3T vs 7T, resolutions, sequences) and assessing the grain of fMRI pattern information. Top: Gratings should be large enough to include a safety margin that minimizes vignetting effects. Studies should include localizers for the V1 representations of the regions shown in red, representing regions on the left and right that are perched on the horizontal meridian and far from the edges of the aperture. For these ROIs, gratings of orientations 45º and -45º (bottom) are (1) balanced about the radial orientation (minimizing effects of neuronal overrepresentation of radial orientations), (2) balanced about the vertical orientation (minimizing effects of neuronal overrepresentation of vertical orientations), and (3) balanced about the orientation of the edge (minimizing any residual long-range vignetting effects).

A more comprehensive experiment would contain perhaps 8 or 16 equally spaced orientations and a range of spatial frequencies balanced about the spatial frequency that maximally drives neurons at the eccentricity of the critical ROIs (Henriksson et al. 2008).

More generally, a standardized experiment along these lines would constitute an excellent benchmark for comparing fMRI acquisition schemes in terms of the information they yield about neuronal response patterns. Such a benchmark would lend itself to comparing different spatial resolutions (0.5 mm, 1 mm, 2 mm, 3 mm), different fMRI sequences, and different field strengths (3T, 7T) across different sites and scanner models. The tradeoffs involved (notably between functional contrast to noise and partial volume sampling) are difficult to estimate without directly testing each fMRI acquisition scheme for the information it yields (Formisano & Kriegeskorte 2012). A standard pattern-information benchmark for fMRI could therefore be really useful, especially if pursued as an open-science project (shared stimuli and presentation protocol, shared fMRI data, contributor coauthorships on the first three papers using someone’s openly shared components).

Glad we sorted this out. Who’s up for collaborating?
Time to go to bed.

Strengths

  • Well-motivated and elegant experimental design and analysis
  • 3T and 7T fMRI data from a total of 14 subjects
  • Compelling results demonstrating that vignetting can cause coarse-scale patterns that enable orientation decoding

Weaknesses

  • The paper claims to introduce a novel idea that requires reinterpretation of a large literature. The claim of novelty is unjustified. Vignetting was discovered by Carlson et al. (2014) and in Wardle et al. (2017), Carlson’s group showed that it may be one, but not the only contributing factor enabling orientation decoding. Carlson et al. deserve clearer credit throughout.
  • The experiments show that vignetting compromised the stimuli of Freeman et al. (2011), but they don’t address whether the claim by Freeman et al. of an over-representation of radial orientations in the neuronal population holds regardless.
  • The paper doesn’t attempt to address whether decoding is still possible in the absence of vignetting effects, i.e. far from the aperture boundary.

Particular comments and suggestions

While the experiments and analyses are excellent and the paper well written, the current version is compromised by some exaggerated claims, suggesting greater novelty and consequence than is appropriate. This should be corrected.

 

“Here, we show that a large body of research that purported to measure orientation tuning may have in fact been inadvertently measuring sensitivity to second-order changes in luminance, a phenomenon we term ‘vignetting’.” (Abstract)

“Our results demonstrate that stimulus vignetting can wholly determine the orientation selectivity of responses in visual cortex measured at a macroscopic scale, and suggest a reinterpretation of a well-established literature on orientation processing in visual cortex.” (Abstract)

“Our results provide a framework for reinterpreting a wide-range
of findings in the visual system.” (Introduction)

Too strong of a claim of novelty. The effect beautifully termed “vignetting” here was discovered by Carlson (2014), and that study deserves the credit for triggering a reevaluation of the literature, which began four years ago. The present study does place vignetting in a broader context, discussing a variety of mechanisms by which aperture edges might influence responses, but the basic idea, including that the key factor is the interaction between the edge and the grating orientation and that the edge need not be hard, are all introduced in Carlson (2014). The present study very elegantly demonstrates the phenomenon with fMRI, but the effect has also previously been studied with fMRI by Wardle et al. (2017), so the fMRI component doesn’t justify this claim, either. Finally, while results compellingly show that vignetting was a strong contributor in Freeman et al. (2011), they don’t show that it is the only contributing factor for orientation decoding. In particular, Wardle et al. (2017) suggests that vignetting in fact is not necessary for orientation decoding.

 

“We and others, using fMRI, discovered a coarse-scale orientation bias in human V1; each voxel exhibits an orientation preference that depends on the region of space that it represents (Furmanski and Engel, 2000; Sasaki et al., 2006; Mannion et al., 2010; Freeman et al., 2011; Freeman et al., 2013; Larsson et al., 2017). We observed a radial bias in the peripheral representation of V1: voxels that responded to peripheral locations near the vertical meridian tended to respond most strongly to vertical orientations; voxels along the peripheral horizontal meridian responded most strongly to horizontal orientations; likewise for oblique orientations. This phenomenon had gone mostly unnoticed previously. We discovered this striking phenomenon with fMRI because fMRI covers the entire retinotopic map in visual cortex, making it an ideal method for characterizing such coarse-scale representations.” (Introduction)

A bit too much chest thumping. The radial-bias phenomenon was discovered by Sasaki et al. (2006). Moreover, the present study negates the interpretation in Freeman et al. (2011). Freeman et al. (2011) interpreted their results as indicating an over-representation of radial orientations in cortical neurons. According to the present study, the results were in fact an artifact of  vignetting and whether neuronal biases played any role is questionable. Freeman et al. used a narrower annulus than other studies (e.g. Kamitani & Tong, 2005), so may have been more susceptible to the vignetting artifact. The authors suggest that a large literature be reinterpreted, but apparently not their own study for which they specifically and compellingly show how vignetting probably affected it.

 

“A leading conjecture is that the orientation preferences in fMRI measurements arise primarily from random spatial irregularities in the fine-scale columnar architecture (Boynton, 2005; Haynes and Rees, 2005; Kamitani and Tong, 2005). […] On the other hand, we have argued that the coarse-scale orientation bias is the predominant orientation-selective signal measured with fMRI, and that multivariate decoding analysis methods are successful because of it (Freeman et al., 2011; Freeman et al., 2013). This conjecture remains controversial because the notion that fMRI is sensitive to fine-scale neural activity is highly attractive, even though it has been proven difficult to validate empirically (Alink et al., 2013; Pratte et al., 2016; Alink et al., 2017).” (Introduction)

This passage is a bit biased. First, the present results question the interpretation of Freeman et al. (2011). While the authors’ new interpretation (following Carlson, 2014) also suggests a coarse-scale contribution, it fundamentally changes the account. Moreover, the conjecture that coarse-scale effects play a role is not controversial. What is controversial is the claim that only coarse-scale effects contribute to fMRI orientation decoding. This extreme view is controversial not because it is attractive to think that fMRI can exploit fine-grained pattern information, but because the cited studies (Alink et al. 2013, Pratte et al. 2016, Alink et al. 2017, and additional studies, including Shmuel et al. 2010 and Swisher et al. 2010) present evidence in favor of a contribution from fine-grained patterns. The way the three studies are cited would suggest to an uninformed reader that they provide evidence against a contribution from fine-grained patterns. More evenhanded language is in order here.

 

“the model we use is highly simplified; for example, it does not take into account changes in spatial frequency tuning at greater eccentricities. Yet, despite the multiple sources of noise and the simplified assumptions of the model, the correspondence between the model’s prediction and the empirical measurements are highly statistically significant. From this, we conclude that stimulus vignetting is a primary source of the course[sic] scale bias.”

This argument is not compelling. A terrible model may explain a portion of the explainable variance that is minuscule, yet highly statistically significant. In the absence of inferential comparisons among multiple models and model checking (or a noise ceiling), better to avoid such claims.

 

“One study (Alink et al., 2017) used inner and outer circular annuli, but added additional angular edges, the result of which should be a combination of radial and tangential biases. Indeed, this study reported that voxels had a mixed pattern of selectivity, with a considerable number of voxels reliably preferring tangential gratings, and other voxels reliably favoring radial orientations.” (Discussion)

It’s true that the additional edges between the patches (though subtle) complicate the interpretation of the results of Alink et al. (2017). It would be good to check the strength of the effect by simulation. Happy to share the stimuli if someone wanted to look into this.

 

Minor points

Figure 4A, legend: Top and bottom panels mislabeled as showing angular and radial modulator results, respectively.

course -> coarse

complimentary -> complementary

 

Humans recognize objects with greater robustness to noise and distortions than deep nets

[I7R8]

Deep convolutional neural networks can label images with object categories at superhuman levels of accuracy. Whether they are as robust to noise and distortions as human vision, however, is an open question.

Geirhos, Janssen, Schütt, Rauber, Bethge, and Wichmann (pp2017) compared humans and deep convolutional neural networks in terms of their ability to recognize 16 object categories under different levels of noise and distortion. They report that human vision is substantially more robust to these modifications.

Psychophysical experiments were performed in a controlled lab environment. Human observers fixated a central square at the start of each trial. Each image was presented for 200 ms (3×3 degrees of visual angle), followed by a pink noise mask (1/f spectrum) of 200-ms duration. This type of masking is thought to minimize recurrent computations in the visual system. The authors, thus, stripped human vision of the option to scrutinize the image and focused the comparison on what human vision achieves through the feedforward sweep of processing (although some local recurrent signal flow likely still contributed). Observers then clicked on one of 16 icons to indicate the category of the stimulus.

The figure below shows the levels of additive uniform noise (left) and local distortion (right) that were necessary to reduce the accuracy of each system to about 50% (classifying among 16 categories). Careful analyses across levels of noise and distortion show that the deep nets perform similarly to the human observers at low levels of noise or distortion. Both humans and deep nets approach chance level performance at very high levels of distortion. However, human performance degrades much more gracefully, beating deep nets when the image is compromised to an intermediate degree.

ScreenShot2386
Figure: At what level of noise and distortion does recognition break down in each system? Additive noise (left) or Eidolon distortion (right) was ramped up, so as to reduce classification accuracy to 50% for a given system. To cause human performance to drop to 50% accuracy (for classification among 16 categories), substantially higher levels of noise or distortion were required (top row). Modified version of Fig. 4 of the paper.

This is careful and important work that helps characterize how current models still fall short. The authors are making their substantial lab-acquired human behavioral data set openly available. This is great, because the data can be analyzed by other researchers in both brain science and computer science.

What the study does not quite deliver is an explanation of why the deep nets fall short. Is it something about the convolutional feedforward architecture that renders the models less robust? Does human vision employ normalization or adaptive filtering operations that enable it to “see through” the noise and distortion, e.g. by focusing on features less affected by the artefacts?

Humans have massive experience with noisy viewing conditions, such as those arising in bad weather. We also have much experience seeing things distorted, through water, or glass that is not perfectly plane. Moreover, peripheral vision may rely on summary-statistical descriptions that may be somewhat robust to the kinds of distortion used in this study.

To assess whether it is visual experience or something about the architecture that causes the networks to be less robust, I suggest that the networks be trained with noisy and/or distorted images. Data augmentation with noise and distortion may help deep nets learn more robust internal representations for vision.

 

Strengths

  • Careful human psychophysical measurements of classification accuracy for 16 categories for a large set of stimuli (40K categorization trials).
  • Detailed comparisons between human performance and performance of three popular deep net architectures (AlexNet, GoogLeNet, VGG-16).
  • Substantial behavioral data set shared with the community.

 

Weaknesses

  • Network architectures not trained with noise and distortion rendering ambiguous whether the deep nets’ lack of robustness is due to architecture or training.
  • Data are not used to evaluate the three models overall in terms of their ability to capture patterns of confusions.
  • Human-machine comparisons focus on overall accuracy under noise and distortion, and on category-level confusions, rather than the processing of particular images.

 

Suggestions for improvements

(1) Train deep nets with noise and distortion. Humans experience noise and distortions as part of their visual world. Would the networks perform better if they were trained with noisy and distorted images? The authors could train the networks (or at least VGG-16) with some image set (nonoverlapping with the images used in the psychophysics) and augment the training set with noisy and distorted variants. This would help clarify to what extent training can improve robustness and to what extent the architecture is the limiting factor.

(2) Evaluate each model’s overall ability to predict human patterns of confusions. The confusion matrix analyses shed some light on the differences between humans and models. However, it would be good to assess which model’s confusions are most similar to the humans overall. To this end one could consider the offdiagonal elements of the confusion matrix (to render the analysis complementary to the analyses of overall accuracy) and statistically compare the models in terms of their ability to explain patterns of confusions. The offdiagonal entries only could be compared by correlation (or 0-fixed correlation).

 

Minor comments

(1) “adversarial examples have cast some doubt on the idea of broad-ranging manlike DNN behavior. For any given image it is possible to perturb it minimally in a principled way such that DNNs mis-classify it as belonging to an arbitrary other category (Szegedy et al., 2014). This slightly modified image is then called an adversarial example, and the manipulation is imperceptible to human observers (Szegedy et al., 2014).”

This point is made frequently, although it is not compelling. Any learner uses an inductive bias to infer a model from data. In general, combining the prior (inductive bias) and the data will not yield perfect decision boundaries. An omniscient adversary can always place an example in the misrepresented region of the input space. Adversarial examples are therefore a completely expected phenomenon for any learning algorithm, whether biological or artificial. The misrepresented volume may have infinitesimal probability mass under natural conditions. A visual system could therefore perform perfectly in the real world — until confronted with an omniscient adversary that backpropagates through its brain to fool it. No one knows if adversarial examples can also be constructed for human brains. If so, they might similarly require only slight modifications imperceptible to other observers.

The bigger point that neural networks fall short of human vision in terms of their robustness is almost certainly true, of course. To make that point on the basis of adversarial examples, however, would requires considering the literature on black-box attacks that do not rely on omniscient knowledge of the system to be fooled or its training set. It would also require applying these much less efficient methods symmetrically to human subjects.

 

(2) “One might argue that human observers, through experience and evolution, were exposed to some image distortions (e.g. fog or snow) and therefore have an advantage over current DNNs. However, an extensive exposure to eidolon-type distortions seems exceedingly unlikely. And yet, human observers were considerably better at recognising eidolon-distorted objects, largely unaffected by the different perceptual appearance for different eidolon parameter combinations (reach, coherence). This indicates that the representations learned by the human visual system go beyond being trained on certain distortions as they generalise towards previously unseen distortions. We believe that achieving such robust representations that generalise towards novel distortions are the key to achieve robust deep neural network performance, as the number of possible distortions is literally unlimited.”

This is not a very compelling argument because the space of “previously unseen distortions” hasn’t been richly explored here. Moreover, the Eidolon-distortions are in fact motivated by the idea that they retain information similar to that retained by peripheral vision. They, thus, discard information that the human visual system is well trained to do without in the periphery.

 

(3) On the calculation of DNNs’ accuracies for the 16 categories: “Since all investigated DNNs, when shown an image, output classification predictions for all 1,000 ImageNet categories, we disregarded all predictions for categories that were not mapped to any of the 16 entry-level categories. Amongst the remaining categories, the entry-level category corresponding to the ImageNet category with the highest probability (top-1) was selected as the network’s response.”

It would seem to make more sense to add up the probabilities of the ImageNet categories corresponding to each of the 16 entry-level categories and use the resulting 16 totals to pick the predicted basic-level category. Alternatively, one may train a new softmax layer with 16 outputs. Please clarify which method was used and how it relates to the other methods.

 

–Nikolaus Kriegeskorte

Thanks to Tal Golan for sharing his comments on this paper with me.

Incremental Bayesian learning of visual encoding models across subjects exposed to different stimuli

[I7R8]

Realistic models of the primate visual system have many millions of parameters. A vision model needs substantial capacity to store the required knowledge about what things look like. Brain activity data are costly, so typically do not suffice to set the parameters of these models. Recent progress has benefited from direct learning of the required knowledge from category-labeled image sets. Nevertheless further fitting with brain-activity data is required to learn about the relative prevalence of the different computational features (and of linear combinations of the features) in each cortical area and to accurately predict representations of novel images (not used in setting model parameters).

Each individual brain is unique. A key challenge is to hold on to what we’ve learned by fitting a visual encoding model to one subject exposed to one set of images when we move on to new experiments. Traditionally, we make inferences about the computational mechanisms with a given data set and hold on to those abstract insights, e.g. that model ResNet beats model AlexNet at predicting ventral visual responses. Ideally, we would be able to hold on to more detailed parametric information learned on one data set as we move on to other data sets.

Wen, Shi, Chen & Liu (pp2017) develop a Bayesian approach to learning encoding models (linear combinations of the features of deep neural networks) incrementally across subjects and stimulus sets. The initial model is fitted with a 0-mean prior on the weights (L2 penalty). The resulting encoding model for each fMRI voxel has a Gaussian posterior over the weights for each feature of the deep net model. The Gaussian posterior is assumed to be isotropic, avoiding the need for a separate variance parameter for each feature (let alone a full covariance matrix).

The results are compelling. Using the posteriors inferred from previous subjects as priors for new subjects substantially increases a model’s prediction performance. This is consistent with the observation that models generalize quite well to new subjects, even without subject-specific fitting. Importantly, the transfer of the weight knowledge from one subject to the next works even when using different stimulus sets in different subjects.

This work takes a first step in the direction of the exciting possibility of incremental learning  of complex models across hundreds or thousands of subjects and millions of stimuli (acquired in labs around the world).

It is interesting to consider the implementation of the inference procedure. Although Bayesian in motivation, the implementation uses L2 penalities for deviation of the weights wv from the previous weights estimate wv0 and from zero. The respective penalty factors α and λ are determined by crossvalidation so as to best predict the new data. This procedure makes a lot of sense. However, it is a bit at a tension with a pure Bayesian approach in two ways: (1) In a pure Bayesian approach, the previous data set should determine the width of the posterior, which becomes the prior for the next data set. Here the width of the prior is adjusted (via α) to optimize prediction performance. (2) In a pure Bayesian approach, the 0-mean prior would be absorbed into the first model’s posterior and would not enter into into the inference again with every update of the posterior with new data.

The cost function for predicting the response profile vector rv (# stimuli by 1) for fMRI voxel v from deep net feature responses F (# stimuli by # features) is:

While the crossvalidation procedure makes sense for optimizing prediction accuracy on the present data set, I wonder if it is optimal in the bigger picture of integrating the knowledge across many studies. The present data set will reflect only a small portion of stimulus space and one subject, so should not get to downweight a prior based on much more comprehensive data.

 

Strengths

  • Addresses an important challenge and suggests exciting potential for big-data learning of computational models across studies and labs.
  • Presents a straightforward and well-motivated method for incremental learning of encoding model weights across studies with different subjects and different stimuli.
  • Results are compelling: Using the prior information helps the performance of an encoding model a lot when the training data for the new subject is limited.

 

Weaknesses

  • The posterior over the weights vector is modeled as isotropic. It would be good to allow different degrees of certainty for different features and, better yet, to model the dependencies between the weights of different features. (However, such richer models might be challenging to estimate in practice.)
  • The prior knowledge transferred from previous studies consists only in the MAP estimate of the weight vector for each voxel.
  • The method assumes that a precise intersubject spatial-correspondence mapping is given. Such mappings might not exist and are costly to approximate with functional data.

 

Suggestions for improvement

(1) Explore and/or discuss if a prior with feature-specific variance might be feasible. Explore whether inferring a posterior distribution over weights using a mean weight vector and feature-specific variances brings even better results. I guess this is hard when there are millions of features.

(2) Consider dropping the assumption that a precise correspondence mapping is given and infer a multinormal posterior over local weight vectors. The model assumes that we have a precise intersubject spatial-correspondence mapping (from cortical alignment based either on anatomical or functional data). It seems more versatile and statistically preferable not to rely on a precise (i.e. voxel-to-voxel) correspondence mapping, but to simultaneously address the correspondence and incremental weight-learning problem. We could assume that an imprecise correspondence mapping is given. For corresponding brain locations in the previous and current subject (subjects 1 and 2), subject-1 encoding models within a small spherical region around the target location could be used to define a prior for fitting an encoding model to the target voxel for subject 2. Such a prior should be a probability distribution over weight vectors, which could be characterized by the second moment  of the weight vector distribution. Regularization, such as optimal shrinkage to a diagonal target or (when there are too many features) simply the assumption that the second moment is diagonal could be used to make this approach feasible. In either case, the goal would be to pool the posterior distributions across voxels within the small sphere and summarize the resulting distribution (e.g. as a multinormal). I realize that this might be beyond the scope of the current study. It is not a requirement for this paper.

(3) Clarify the terminology used for the estimation procedures. What is referred to as “maximum likelihood estimation” uses an L2 penalty on the weights, amounting to Bayesian inference of the weights with a 0-mean Gaussian prior. This is not a maximum likelihood estimator. Please correct this (or explain in case I am mistaken).

(4) Consider how to ensure that the prior has an appropriate width (and the prior evidence thus appropriate weight). Should a more purely Bayesian approach be taken, where the width of the posterior is explicitly inferred and becomes the width of the prior? Should the crossvalidation setting of the hyperparameters use a very varied test set to prevent the current (possibly narrowly specialized) data set from being given too much weight? Should the amount of data contributing to the prior model and the amount of data in the present set (and optionally the noise level) be used to determine the relative weighting?

Do coarser spatial patterns represent coarser categories in visual cortex?

[I7R5]

 

Wen, Shi, Chen, and Liu (pp2017) used a deep residual neural network (trained on visual object classification) as an encoding model to explain human cortical fMRI responses to movies. The deep net together with the encoding weights of the cortical voxels was then used to predict human cortical response patterns to 64K object images from 80 categories. This prediction serves, not to validate the model, but to investigate how cortical patterns (as predicted by the model) reflect the categorical hierarchy.

The authors report that the predicted category-average response patterns fall into three clusters corresponding to natural superordinate categories: biological things, nonbiological things, and scenes. They argue that these superordinate categories characterize the large-scale organization of human visual cortex.

For each of the three superordinate categories, the authors then thresholded the average predicted activity pattern and investigated the representational geometry within the supra-threshold volume. They find that biological things elicit patterns (within the subvolume responsive to biological things) that fall into four subclusters: humans, terrestrial animals, aquatic animals, and plants. Patterns in regions activated by scenes clustered into artificial and natural scenes. The patterns in regions activated by non-biological things did not reveal clear subdivisions.

The authors argue that this shows that superordinate categories are represented in global patterns across higher visual cortex, and finer-grained categorical distinctions are represented in finer-grained patterns within regions responding to superordinate categories.

This is an original, technically sophisticated, and inspiring paper. However, the title claim is not compellingly supported by the evidence. The fact that finer grained distinctions become apparent in pattern correlation matrices after restricting the volume to voxels responsive to a given category is not evidence for an association between brain-spatial scales and conceptual scales. To understand this, consider the fact that the authors’ analyses do not take the spatial positions of the voxels (and thus the spatial structure) into account at all. The voxel coordinates could be randomly permuted and the analyses would give the same results.

The original global representational dissimilarity (or similarity) matrices likely contain distinctions not only at the superordinate level, but also at finer-grained levels (as previously shown). When pattern correlation is used, these divisions might not be prominent in the matrices because the component shared among all exemplars within a superordinate category dominates. Recomputing the pattern correlation matrix after reducing the patterns to voxels responding strongly to a given superordinate category will render the subdivisions within the superordinate categories more prominent. This results from the mean removal implicit to the pattern correlation, which will decorrelate patterns that share high responses on many of the included voxels. Such a result does not indicate that the subdivisions were not present (e.g. significantly decodable from fMRI or even clustered) in the global patterns.

A simple way to take spatial structure into account would be to restrict the analysis to a single spatially contiguous cluster at a time, e.g. FFA. This is in fact the approach taken in a large number of previous studies that investigated the representations in category-selective regions (LOC, FFA, PPA, RSC, etc.). Another way would be to spatially filter the patterns and investigate whether finer semantic distinctions are associated with finer spatial scales. This approach has also been used in previous studies, but can be confounded by the presence of an unknown pattern of voxel gains (Freeman et al. 2013; Alink et al. 2017, Scientific Reports).

The approach of creating a deep net model that explains the data and then analyzing the model instead of the data is a very interesting idea, but also raises some questions. Clearly we need deep nets with millions of parameters to understand visual processing. If a deep net explains visual responses throughout the visual system and shares at least some architectural similarities with the visual hierarchy, then it is reasonable to assume that it might capture aspects of the computational mechanism of vision. In a sense, we have “uploaded” aspects of the mechanism of vision into the model, whose workings we can more efficiently study. This is always subject to consideration of alternative models whose architecture might better match what is known about the primate visual system and which might predict visual responses even better. Despite this caveat, I believe that developing deep net models that explain visual responses and studying their computational mechanisms is a promising approach in general.

In the present context, however, the goal is to relate conceptual levels of categories to spatial scales of cortical response patterns, which can be directly measured. Is the deep net really needed to address this? To study how categories map onto cortex, why not just directly study measured response patterns? This is fact is what the existing literature has done for years. The deep net functions as a fancy interpolator that imputes data where we have none (response patterns for 64K images). However, the 80 category-average response patterns could have been directly measured. Would this not be more compelling? It would not require us to believe that the deep net is an accurate model.

Although the authors have gotten off to a fresh start on the intriguing questions of the spatial organization of higher-level visual cortex, the present results do not yet go significantly beyond what is known and the novel and interesting methods introduced in the paper (perhaps the major contribution) raise a number of questions that should be addressed in a revision.

 

ScreenShot2230
Figure: ResNet provides a better basis for human-fMRI voxel encoding models than AlexNet.

 

Strengths

  • Presents several novel and original ideas for the use of deep neural net models to understand the visual cortex.
  • Uses 50-layer ResNet model as encoding model and shows that this model performs better than the simpler AlexNet model.
  • Tests deep net models trained on movie data for generalization to other movie data and prediction of responses in category-selective-region localizer experiments.
  • Attempts to address the interesting hypothesis that larger scales of cortical organization serve to represent larger conceptual scales of categorical representation.
  • The analyses are implemented at a high level of technical sophistication.

 

Weaknesses

  • The central claim about spatial structure of cortical representations is not supported by evidence about the spatial structure. In fact, analyses are invariant to the spatial structure of the cortical response patterns.
  • Unclear what added value is provided by the deep net for addressing the central claim that larger spatial scales in the brain are associated with larger conceptual scales.
  • Uses a definition of “modularity” from network theory to analyze response pattern similarity structure, which will confuse cognitive scientists and cognitive neuroscientists to whom modularity is a computational and brain-spatial notion. Fails to resolve the ambiguities and confusions pervading the previous literature (“nested hierarchy”, “module”).
  • Follows the practice in cognitive neuroscience of averaging response patterns elicited by exemplars of each category, although the deep net predicts response patterns for individual images. This creates ambiguity in the interpretation of the results.
  • The central concepts modularity and semantic similarity are not properly defined, either conceptually or in terms of the mathematical formulae used to measure them.
  • The BOLD fMRI measurements are low in resolution with isotropic voxels of 3.5 mm width.

 

Suggestions for improvements

 

(1) Analyze to what extent different spatial scales in cortex reflect information about different levels of categorization (or change the focus of the paper)

The ResNet encoding model is interesting from a number of perspectives, so the focus of the paper does not have to be on the association of spatial cortical and conceptual scales. If the paper is to make claims about this difficult, but important question, then analyses should explicitly target the spatial structure of cortical activity patterns.

The current analyses are invariant to where responses are located in cortex and thus fundamentally cannot address to what extent different categorical levels are represented at different spatial scales. While the ROIs (Figure 8a) show prominent spatial clustering, this doesn’t go beyond previous studies and doesn’t amount to showing a quantitative relationship.

The emergence of subdivisions within the regions driven by superordinate-category images could be entirely due to the normalization (mean removal) implicit to the pattern correlation. Similar subdivisions could exist in the complementary set of voxels unresponsive to the superordinate category, and/or in the global patterns.

Note that spatial filtering analyses might be interesting, but are also confounded by gain-field patterns across voxels. Previous studies have struggled to address this issue; see Alink et al. (2017, Scientific Reports) for a way to detect fine-grained pattern information not caused by a fine-grained voxel gain field.

 

(2) Analyze measured response patterns during movie or static-image presentation directly, or better motivate the use of the deep net for this purpose

The question how spatial scales in cortex relate to conceptual scales of categories could be addressed directly by measuring activity patterns elicited by different images (or categories) with fMRI. It would be possible, for instance, to measure average response patterns to the 80 categories. In fact previous studies have explored comparably large sets of images and categories.

Movie fMRI data could also be used to address the question of the spatial structure of visual response patterns (and how it relates to semantics), without the indirection of first training a deep net encoding model. For example, the frames of the movies could be labeled (by a human or a deep net) and measured response patterns could directly be analyzed in terms of their spatial structure.

This approach would circumvent the need to train a deep net model and would not require us to trust that the deep net correctly predicts response patterns to novel images. The authors do show that the deep net can predict patterns for novel images. However, these predictions are not perfect and they combine prior assumptions with measurements of response patterns. Why not drop the assumptions and base hypothesis tests directly on measured response patterns?

In case I am missing something and there is a compelling case for the approach of going through the deep net to address this question, please explain.

 

(3) Use clearer terminology

Module: The term module refers to a functional unit in cognitive science (Fodor) and to a spatially contiguous cortical region that corresponds to a functional unit in cognitive neuroscience (Kanwisher). In the present paper, the term is used in the sense of network theory. However it is applied not to a set of cortical sites on the basis of their spatial proximity or connectivity (which would be more consistent with the meaning of module in cognitive neuroscience), but to a set of response patterns on the basis of their similarity. A better term for this is clustering of response patterns in the multivariate response space.

Nested hierarchy: I suspect that by “nested” the authors mean that there are representations within the subregions responding to each of the superordinate categories and that by “hierarchy” they refer to the levels of spatial inclusion. However, the categorical hierarchy also corresponds to clusters and subclusters in response-pattern space, which could similarly be considered a “nested hierarchy”. Finally, the visual system is often characterized as a hierarchy (referring to the sequence of stages of ventral-stream processing). The paper is not sufficiently clear about these distinctions. In addition, terms like “nested hierarchy” have a seductive plausibility that belies their lack of clear definition and the lack of empirical evidence in favor of any particular definition. Either clearly define what does and does not constitute a “nested hierarchy” and provide compelling evidence in favor of it, or drop the concept.

 

(4) Define indices measuring “modularity” (i.e. response-pattern clustering) and semantic similarity

You cite papers on the Q index of modularity and the LCH semantic similarity index. These indices are central to the interpretation of the results, so the reader should not have to consult the literature to determine how they are mathematically defined.

 

(5) Clarify results on semantic similarity

The correlation between LCH semantic similarity and cortical pattern correlation is amazing (r=0.93). Of course this has a lot to do with the fact that LCH takes a few discrete values and cortical similarity was first averaged within each LCH value.

What is the correlation between cortical pattern similarity and semantic similarity…

  • for each of the layers of ResNet before remixing to predict human fMRI responses?
  • after remixing to predict human fMRI responses for each of a number of ROIs (V1-3, LOC, FFA, PPA)?
  • for other, e.g. word-co-occurrence-based, semantic similarity measures (e.g. word2vec, latent semantic analysis)?

 

(6) Clarify the methods details

I didn’t understand all the methods details.

  • How were the layer-wise visual feature sets defined? Was each layer refitted as an encoding model? Or were the weights from the overall encoding model used, but other layers omitted?
  • I understand that the sub-divisions of the three superordinate categories were defined by k-means clustering and that the Q index (which is not defined in the paper) was used. How was the number k of clusters determined? Was k chosen to maximize the Q index?
  • How were the category-associated cortical regions defined, i.e. how was the threshold chosen?

 

 

(7) Cite additional previous studies

Consider discussing the work of Lorraine Tyler’s lab on semantic representations and Thomas Carlson’s paper on semantic models for explaining similarity structure in visual cortex (Carlson et al. 2013, Journal of Cognitive Neuroscience).

A brief overview of classification models in vision science

[I5R6]

 

Majaj & Pelli (pp2017) give a brief overview of classification models in vision science, leading from linear discriminants and the perceptron to deep neural networks. They discuss some of the perks and perils of using machine learning, and deep learning in particular, in the study of biological vision.

This is a brief and light-footed review that will be of interest to vision scientists wondering whether and why to engage machine learning and deep learning in their own work. I enjoyed some of the thoughtful notes on the history of classification models and the sketch of the progression toward modern deep learning.

The present draft lists some common arguments for and against deep learning models, but falls short of presenting a coherent perspective on why deep learning is important for vision science, or not; or which aspects are substantial and which are hype. It also doesn’t really explain deep learning or how it relates to the computational challenge of vision.

The overall conclusion is that machine learning and deep learning are useful modern tools for the vision scientist. In particular, the authors argue that deep neural networks provide a “benchmark” to compare human performance to, replacing the optimal linear filter and signal detection theory as the normative benchmark for vision. This misses what I would argue is the bigger point: deep neural networks provide an entry point for modeling brain information processing and engaging the real problem of vision, rather than a toy version of the problem that lacks all of vision’s essential challenges.

 

Suggestions for improvements

(1) Clearly distinguish deep learning within machine learning

The abstract doesn’t mention deep learning at all. As I was reading the introduction, I was wondering if deep learning had been added to the title of a paper about machine learning in vision science at the very end. Deep learning is defined as “the latest version of machine learning”. This is incorrect. Rather than a software product that is updated in a sequence of versions, machine learning is a field that explores a wide variety of models and inference algorithms in parallel. The fact that deep learning (which refers to learning of deep neural network models) is getting a lot of attention at the moment does not mean that other approaches, notably Bayesian nonparametric models, have lost appeal. How is deep learning different? Does it matter more for vision than other approaches? If so, why?

 

(2) Explain why depth matters

The multiple stages of nonlinear transformation that define deep learning models are essential for many real-world applications, including vision. I think this point should be central as it explains why vision science needs deep models.

 

(3) Clearly distinguish the use of machine learning models to (a) analyze data and to (b) model brain information processing

The current draft largely fails to distinguish two ways of using machine learning in vision science: to analyze data (e.g. decode neuronal population codes) and to model brain information processing. Both are important, but the latter more fundamentally advances the field.

 

(4) Relate classification to machine learning more broadly and to vision

The present draft presents a brief history of classification models. Classification is a small (though perhaps arguably key?) problem within both machine learning and vision. Why is this particular problem the focus of such a large literature and of this review? How does it relate to other problems in machine learning and in vision?

 

(5) Separate the substance from the hype and present a coherent perspective

Arguments for and against deep learning are listed without evaluation or a coherent perspective. For example, is it true that deep learning models have “too many parameters”? Should we strive to model vision with a handful of parameters? Or do models need to be complex because vision requires complex domain knowledge? Do tests of generalization performance address the issue of overfitting? (No, no, yes, yes.) Note that the modern version of the statistical modeling, which is touted as more rigorous, is Bayesian nonparametrics – defined by no limits on the parametric complexity of a model.

 

(6) Consider addressing my particular comments below.

 

Particular comments

“Many perception scientists try to understand recognition by living organisms. To them, machine learning offers a reference of attainable performance based on learned stimuli.”

It’s not really a normative reference. There is an infinity of neural network models and performance of a particular one can never be claimed to be “ideal”. Deep learning is worse in this respect than the optimal linear filter (which provides a normative reference for a task – with the caveat that the task is not vision).

 

“Deep learning is the latest version of machine learning, distinguished by having more than three layers.”

It’s not the “latest version”, rather it’s an old variant of machine learning that is currently very successful and popular. Also, a better definition of deep is that there is more than one hidden layer intervening between input and output layers.

 

“It is ubiquitous in the internet.”

How is this relevant?

 

“Machine learning shifts the emphasis from how the cells encode to what they encode, i.e. from how they encode the stimulus to what that code tells us about the stimulus. Mapping a receptive field is the foundation of neuroscience (beginning with Weber’s 1834/1996 mapping of tactile “sensory circles”), but many young scientists are impatient with the limitations of single-cell recording: looking for minutes or hours at how one cell responds to each of perhaps a hundred different stimuli. New neuroscientists are the first generation for whom it is patently clear that characterization of a single neuron’s receptive field, which was invaluable in the retina and V1, fails to characterize how higher visual areas encode the stimulus. Statistical learning techniques reveal “how neuronal responses can best be used (combined) to inform perceptual decision-making” (Graf, Kohn, Jazayeri, & Movshon, 2010).”

This is an important passage. It’s true that single neurons in inferior temporal cortex, for example, might be (a) difficult to characterize singly with tuning functions, (b) idiosyncratic to a particular animal, and (c) so many in number and variety that characterizing them one by one seems hopeless. It therefore appears more productive to focus on understanding the population code. However, it is not only what is encoded in the population, but also how it is encoded. The format determines what inferences are easy given the code. For example, we can ask what information could be gleaned by a single downstream neuron computing a linear or radial-basis-function readout of the code.

 

“For psychophysics, Signal Detection Theory (SDT) proved that the optimal classifier for a signal in noise is a template matcher (Peterson, Birdsall, & Fox, 1954; Tanner & Birdsall, 1958).”

Detecting chihuahuas in complex scenes can be considered an example of detecting “signal in noise”, and it is an example of a visual task. A template matcher is certainly not optimal for this problem (in fact it will fail severely at this problem). It would help here to define signal and noise.

The problem of detecting a fixed pattern in Gaussian noise needs to be explained first in any course of vision, so as to inoculate students against the misconstrual of the problem of vision it represents. On a more conciliatory note, one could argue that although detecting a fixed pattern in noise is a misleading oversimplification of vision, it captures a component of the problem. The optimal solution to this problem, template matching, captures a component of the solution to vision. Deep feedforward neural networks could be described as hierarchical template matchers, and they do seem to capture some aspects of vision.

 

“SDT has been a very useful reference in interpreting human psychophysical performance (e.g. Geisler, 1989; Pelli et al., 2006). However, it provides no account of learning. Machine learning shows promise of guiding today’s investigations of human learning and may reveal the constraints imposed by the training set on learning.”

In addition to offering learning algorithms that might relate to how brains learn, machine learning enables us to use realistically complex models at all.

 

“It can be hard to tell whether behavioral performance is limited by the set of stimuli, or the neural representation, or the mismatch between the neural decision process and the stimulus and task. Implications for classification performance are not readily apparent from direct inspection of families of stimuli and their neural responses.”

Intriguing, but cryptic. Please clarify.

 

“Some biologists complain that neural nets do not match what we know about neurons (Crick, 1989; Rubinov, 2015).”

It is unclear how the ideal “match” should even be defined. All models abstract, and that is their purpose. Stating a feature of biology that is absent in the model does not amount to a valid criticism. But there is a more detailed case to be made for incorporating more biologically realistic dynamic components, so please elaborate.

 

“In particular, it is not clear, given what we know about neurons and neural plasticity, whether a backpropagation network can be implemented using biologically plausible circuits (but see Mazzoni et al., 1991, and Bengio et al., 2015).”

Neural net models can be good models of perception without being good models of learning. There has also been a recent resurgence in work exploring how backpropagation, or a closely related form of credit assignment, might be implemented in brains. Please discuss the work along these lines by Senn, Richards, Bogacz, and Bengio.

 

“Some biological modelers complain that neural nets have alarmingly many parameters. Deep neural networks continue to be opaque”

Why are many parameters “alarming” from the more traditional perspective on modeling? Do you feel that the alarm is justified? My view is that the history of AI has shown that intelligence requires rich domain knowledge. Simple models therefore will not be able to explain brain information processing. Machine learning has taught us how to learn complex models and avoid their pitfalls (overfitting).

 

“Some statisticians worry that rigorous statistical tools are being displaced by machine learning, which lacks rigor (Friedman, 1998; Matloff, 2014, but see Breiman, 2001; Efron & Hastie, 2016).”

The classical simple models can’t cut it, so their rigour doesn’t help us. Machine learning has boldly engaged complex models as are required for AI and brain science. To be able to do this, it initially took a pragmatic computational, rather than a formal probabilistic approach. However, machine learning and statistics have since grown together in many ways, providing a very general perspective on probabilistic inference that combines complexity and rigor.

 

“It didn’” (p. 9) Fragment.

 

“Unproven convexity. A problem is convex if there are no local minima other than the global minimum.”

I think this is not true. Here’s my current understanding: If a problem is convex, then any local minimum is the global minimum. This is convenient for optimization and provably not the case for neural networks. However, the reverse implication does not hold: if every local minimum is a global minimum, the function is not necessarily convex. There is a category of cost functions that are not convex, but every local minimum is a global minimum. Neural networks appear to fall in this category (at least under certain conditions that tend to hold in practice).

Note that there can be multiple global minima. In fact, the error function of a neural network over the weight domain typically has many symmetries, with any given set of weights having many computationally equivalent twins (i.e. the model computes the same overall function for different parameter settings). The high dimensionality, however, is not a curse, but a blessing for gradient descent: In a very high-dimensional weight space, it is unlikely that we find ourselves trapped, with the error surface rising in all directions. There are too many directions to escape in. Several papers have argued that local minima are not an issue for deep learning. In particular, it has been argued that every local minimum is a global minimum and that every other critical point is a saddle point, and that saddle points are the real challenge. Moreover, deep nets with sufficient parameters can fit the training data perfectly (interpolating), while generalizing well (which, surprisingly, some people find surprising). There is also evidence that stochastic gradient descent finds flat minima corresponding to robust solutions.

ScreenShot2204
Example of a non-convex error function whose every local minimum is a global minimum (Dauphin et al. pp2014).

“This [convexity] guarantees that gradient-descent will converge to the global minimum. As far as we know, classifiers that give inconsistent results are not useful.”

That doesn’t follow. A complex learner, such as an animal or neural net model, with idiosyncratic and stochastic initialization and experience may converge to an idiosyncratic solution that is still “useful” – for example, classifying with high accuracy and a small proportion of idiosyncratic errors.

 

“Conservation of a solution across seeds and algorithms is evidence for convexity.”

No, but it may be evidence for a minimum with a large basin of attraction. Would need to define what counts as conservation of a solution: (1) identical weights, (2) computationally equivalent weights (same input-output mapping). Definition 2 seems more helpful and relevant.

 

““Adversarial” examples have been presented as a major flaw in deep neural networks. These slightly doctored images of objects are misclassified by a trained network, even though the doctoring has little effect on human observers. The same doctored images are similarly misclassified by several different networks trained with the same stimuli (Szegedy, et al., 2013). Humans too have adversarial examples. Illusions are robust classification errors. […] The existence of adversarial examples is intrinsic to classifiers trained with finite data, whether biological or not.”

I agree. We will know whether humans, too, are susceptible to the type of adversarial example described in the cited paper, as soon as we manage to backpropagate through the human visual system so as to construct comparable adversarial examples for humans.

 

“SDT solved detection and classification mathematically, as maximum likelihood. It was the classification math of the sixties. Machine learning is the classification math of today. Both enable deeper insight into how biological systems classify. In the old days we used to compare human and ideal classification performance. Today, we can also compare human and machine learning.”

“…the performance of current machine learning algorithms is a useful benchmark”

SDT is classification math for linear models, ML is classification math for more complex models. These models enable us to tackle the real problem of vision. Rather than comparing human performance to a normative ideal of performance on a toy task, we can use deep neural networks to model the brain information processing underlying visual recognition. We can evaluate the models by comparing their internal representations to brain representations and their behavior to human behavior, including not only the ways they shine, but also the ways they stumble and fail.

 

 

Recurrent neural net model trained on 20 classical primate decision and working memory tasks predicts compositional neural architecture

[I8R8]

 

Yang, Song, Newsome, and Wang (pp2017) trained a rate-coded recurrent neural network with 256 hidden units to perform a variety of classical cognitive tasks. The tasks combine a number of component processes including evidence accumulation over time, multisensory integration, working memory, categorization, decision making, and flexible mapping from stimuli to responses. The tasks include:

  • speeded response indicating the direction of the stimulus (stimulus-response mapping)
  • speeded response indicating the opposite of the direction of the stimulus (flexible stimulus-response mapping)
  • response indicating the direction of a stimulus after a delay during which the stimulus is not visible (working memory)
  • decision indicating which of two noisy stimulus inputs is stronger (evidence accumulation)
  • decision indicating which of two ranges of the stimulus variable the stimulus falls in (categorization)

The 20 distinct tasks result from combining in various ways the requirements of accumulating stimulus evidence from two sensory modalities, maintaining stimulus evidence in working memory during a delay, deciding which category the stimulus fell in, and flexible mapping to responses.

The tasks reduce cognition to its bare bones and the model abstracts from the real-world challenges of perception (pattern recognition) and motor control, so as to focus on the flexible linkage between perception and action that we call cognition. The input to the model includes a “fixation” signal, sensory stimuli varying along a single circular dimension, and an rule input, that specifies a task index.

The fixation signal is given through a special unit, whose activity corresponds to the presence of a fixation dot on the screen in front of a primate subject. The fixation signal accompanies the perceptual and maintenance phases of the task, and its disappearance indicates that the primate or model should respond. The sensory stimulus (“direction of stimulus from fixation”) is encoded in a set of direction-tuned units representing the circular dimension. Each of two sensory modalities is represented by such a set of units. The task rule is entered in one-hot format through a set of task units that receive the task index throughout performance of a task (no need to store the current task in working memory). The motor output is a “saccade direction” encoded, similarly to the stimulus, by a set of direction-tuned units.

Such tasks have long been used in nonhuman primate cell recording and human imaging studies, and also in rodent studies, in order to investigate how basic building blocks of cognition are implemented in the brain. This paper provides an important missing link between primate cognitive neurophysiology and rate-coded neural networks, which are known to scale to real-world artificial intelligence challenges.

Unsurprisingly, the authors find that the network learns to perform all 20 tasks after interleaved training on all of them. They then perform a number of well-motivated analyses to dissect the trained network and understand how it implements its cognitive feats.

An important question is whether particular units serve task-specific or task-general functions. One extreme hypothesis is that each task is implemented in a separate set of units. The opposite hypothesis is that all tasks employ all units. In order to address the degree of task-generality of the units, the authors measure the extent to which each unit conveys relevant information in each task. This is measured by the variance of a unit’s activity across different  conditions within a task (termed the task variance). The authors find that the network learns to share some of the dynamic machinery it learns among different tasks.

ScreenShot2201
Figure 4 from the paper shows the extent to which two tasks are subserved by disjoint or overlapping sets of units. Each panel shows a comparison between two tasks (decision making about modality 1, DM1; delayed decision making about modality 1, Dly DM 1; Context-dependent decision making about modality 1, Ctx DM 1; delayed match to category, DMC; delayed non-match to category, DNMC). The histograms show how the 256 units are distributed in terms of their “fractional task variance” (FTV), which measures the degree to which a unit conveys information in task 1 (FTV = -1), in task 2 (FTV = 1) or in both equally (FTV = 0).

The authors find evidence for a compositional implementation of the tasks in the trained network. Compositionality here means that the tasks employ overlapping sets of functional components of the network. Rather than learning a separate dynamic systems for each task, the network appears to learn dynamic components serving different functions that can be flexibly combined to enable performance of a wide range of tasks.

The authors’ argument in favor of a compositional architecture is based on two observations: (1) Pairs of tasks that share cognitive component functions tend to involve overlapping sets of units. (2) Task-rule inputs, though trained in one-hot format, can be linearly combined (e.g. Delay Anti = Anti + Delay Go – Go) and the network given such a task specification (which it has never been trained on) will perform the implied task with high accuracy.

 

ScreenShot2202
Figure 6 from the paper supports the argument that the network learns a compositional architecture. During training, the task rule index is given in the form of a one-hot vector (a). The trained network can be given a linear combination of the trained task rules (c), such that the that adding and subtracting component functions (e.g. anti-mapping of stimuli to responses, working memory maintenance over delay, speeded reaction) according to the weights specifies a different task (Delay Anti = Anti + Delay Go – Go). The network then performs the compositionally specified task with high performance, although the task rule input corresponding to that task was 0.

These analyses are interesting because they help us understand how the network works and because they can also be applied to primate cell recordings and help us compare models to brains.

When the network is sequentially trained on one task at a time, the learning of new tasks interferes with previously acquired tasks, reducing performance. However, a continual learning technique that selectively protects certain learned connections enabled sequential acquisition of multiple tasks.

Overall, this is a highly original paper presenting a simple, yet well-motivated model and several useful analysis methods for understanding biological and artificial neural networks. The model extends the authors’ previous work on the neural implementation of some of these components of cognition. Importantly, the paper helps strengthen the link between rate-coded neural network models and primate (and rodent) cognitive neuroscience.

 

Strengths

  • The model is simple and well-designed and helps us imagine how basic components of cognition might be implemented in a recurrent neural network. It is essential that we build task-performing models to complement our fallible intuitions as to the signatures of cognitive processes we should expect in neuronal recordings.
  • The paper links primate cognitive neurophysiology to rate-coded neural networks trained with stochastic gradient descent. This might help boost future interactions between neurophysiologists and engineers.
  • The measures and analyses introduced to dissect the network are well-motivated, straightforward, and imaginative. Several of them can be equally applied to models and neuronal recordings.
  • The paper is well-written, clear, and tells an interesting story.
  • The figures are of high quality.

 

Weaknesses

  • The tasks are so simple that they do not pose substantial computational challenges. This is a strength because it makes it easier to understand neuronal responses in primate brains and unit responses in models. We have to start from the simplest instances of cognition. However, it is also a weakness. Consider the comparison to understanding the visual system. One approach is to reduce vision to discriminating two predefined images. The optimal algorithm for this task is a linear filter applied to the image. The intuitive reduction of vision to this scenario supports the template-matching model. However, this task and its optimal solution fundamentally misconstrues the challenge of visual recognition in the real world, which has to deal with complex accidental variation within each category to be recognized. The dominant current vision model is provided by deep neural networks, which perform multiple stages of nonlinear transformation and learn rich knowledge about the world. Simple cognitive tasks provide a starting point, but – like the two-image discrimination task in vision – abstract away many essential features of cognition. In vision, models are tested in terms of their performance on never seen images – a generalization challenge at the heart of what vision is all about. In cognition as well, we ultimately have to engage complex tasks and test models in terms of their ability to generalize to new instances drawn randomly from a very complex space. The paper leaves me wondering how we can best take small steps from the simple tasks dominating the literature toward real-world cognitive challenges.
  • The paper does not compare a variety of models. Can we learn about the mechanism the brain employs without comparing alternative models? Rate-coded recurrent neural networks are universal approximators of dynamical systems. This property is independent of particular choices defining the units. It is entirely unsurprising that such a model, trained with stochastic gradient descent, can learn these tasks (and the supertask of performing all 20 of them). Given the simplicity of the tasks, it is also not surprising that 256 recurrent units suffice. In fact, the authors report that the results are robust between 128 and 512 recurrent units. The value of this project consists in the way it extends our imagination and generates hypotheses (to be tested with neuronal recordings) about the distributions of task-specific and task-general units. The simplicity of the model and its gradient descent training provides a compelling starting point. However, there are infinite ways a recurrent neural network might implement performance at these tasks. It will be important to contrast alternative task-performing models and adjudicate between them with brain and behavioral data.
  • The paper does not include analyses of biological recordings or behavioral data, which could help us understand the degree to which the model resembles or differs from the primate brain in the way it implements task performance.

Addressing all of these weaknesses could be considered beyond the scope of the current paper. But the authors should consider if they can go toward addressing some of them.

 

Suggested improvements

(1) It might be useful to explicitly model the 20 tasks in terms of cognitive component functions (multisensory integration, evidence accumulation, working memory, inversion of stimulus-response mapping, etc.). The resulting matrix could be added to Table 1 or shown separately. This compositional cognitive description of the tasks could be used to explain the patterns of unit involvement in different tasks (e.g. as measured by task variance) using a linear model. The compositional model could then be inferentially compared to a non-compositional model in which each task is has a single cognitive component function. This more hypothesis-driven approach might help to address the question of compositionality inferentially.

(2) The depiction of the neural network model in Figure 1 could give a better sense of the network complexity and architecture. Instead of the three-unit icon in the middle, how about a directed graph with 256 dots, one for each recurrent unit, and a separate circular arrangements of input and output units (how many were there?). Instead of the network-unit icon with the cartoon of the nonlinear activation, why not show the actual softplus function?

(3) It would the good to see the full 2562 connectivity matrix (ordered by clusters) and the network as a graph with nodes arranged by proximity in the connectivity matrix and edges colored to indicate the weights.

(4) The paper states that “the network can maintain information throughout a delay period of up to five seconds.” What does time in seconds mean in the context of the model? Is time meaningful because the units have time constants similar to biological neurons? It would be good to add supplementary text and perhaps a figure that explains how the pace of processing is matched to biological neural networks. If the pace is not compellingly matched, on the other hand, then perhaps real time units (e.g. seconds) should not be used when describing the model results.

(5) Please clarify whether the hidden units are fully recurrently connected. It would also be good to extend the paper to report how the density of recurrent connectivity affects task performance, learning, clustering and compositionality.

(6) The initial description of task variance is not entirely clear. State explicitly that one task variance estimate is computed for each task, reflecting the response variance across conditions within that task, and thus providing a measure of the stimulus-information conveyed during the task.

(7) Clustering is useful here as an exploratory and descriptive technique for dissecting the network, carving the model at its joints. However, clustering methods like k-means always output clusters, even when the data are drawn from a unimodal continuous distribution. The title claim of “clusters” thus should ideally be substantiated (by  inferential comparison to a continuous model) or dropped.

(8) The clustering will depend on the multivariate signature used to characterize each unit. Instead of task variance patterns, a unit’s connectivity (incoming and outgoing) could be used as a signature and basis of clustering. How do results compare for this method? My guess is that using the task variance pattern across tasks tends to place units in the same cluster if they contribute to the same task, although they might represent different stimulus information in the task. If this is the motivation, it would be good to explain it more explicitly.

(9) It is an interesting question whether units in the same cluster serve the same function. (It seems unlikely in the present analyses, but would be more plausible if clustering were based on incoming and outgoing weights.) The hypothesis that units in a cluster serve the same function could be made precise by saying that the units in a cluster share the same patterns of incoming and outgoing connections, except for weight noise resulting from the experiential and internal noise during training. Under this hypothesis incoming weights are exchangeable among units within the same cluster. The same holds for outgoing weights. The hypothesis could, thus, be tested by shuffling the incoming and the outgoing weights within each cluster and observing performance. I would expect performance to drop after shuffling and would interpret this as a reminder that the cluster-level summary is problematic. Alternatively, to the extent that clusters do summarize the network well, one might try to compress the network down to one unit per cluster, by combining incoming and outgoing weights (with appropriate scaling), or by training a cluster-level network to approximate the dynamics of the original network.

(10) The method of t-SNE is powerful, but its results strongly depend on the parameter settings, creating an issue of researcher degrees of freedom. Moreover, the objective function is difficult to state precisely in a single sentence (if you disagree, please try). Multidimensional scaling by contrast uses a range of objective functions that are easy to define in a single sentence. I wonder why t-SNE should be preferred in this particular context.

(11) Another way to address compositionality would be to assess whether a new task can be more rapidly acquired if its components have been trained as part of other tasks previously.

(12) In Fig. 3 c and e, label the horizontal axis (cluster).

(13) It is great that the Tensorflow implementation will be shared. It would be good if the model data could also be shared in formats useful to people using Python as well as Matlab. This could be a great resource for students and researchers. Please state more completely in the main paper exactly what (Python code? Task and model code? Model data?) will be available where (Github?).

(14) After sequential training, performance at multisensory delayed decision making does not appear to suffer compared to interleaved training. Was this because multisensory delayed decision making was always the last task (thus not overwritten) or is it more robust because it shares more components with other tasks?

(15) A better word for “linear summation” is “sum”.