Reading nonverbal thoughts from the human brain

Tomoyasu Horikawa presents a method called “mind captioning” for decoding perceptual and cognitive content in the form of English text from human brain activity measured with functional MRI (brain-to-text: b2t). Brain-to-text decoding is an important concept because of the versatility and universality of language: It promises to enable us to read out all kinds of brain representations (not just those of linguistic content) and thus has broad potential for neuroscience and applications requiring brain-machine interfaces.

The author relied on human annotators to generate multiple verbal captions describing each of thousands of videos (video-to-text: v2t). He fed these text captions to a neural-network language model to obtain a compressed semantic feature vector characterizing the content of each video (text-to-features: t2f). He trained an L2-regularized linear decoder for each semantic feature to predict the features from human brain activity measured with functional MRI (fMRI) while subjects watched videos (brain-to-features: b2f). He then converted the features to text (features-to-text: f2t) using an iterative text synthesis procedure to invert the t2f mapping.

This iterative evolutionary text synthesis procedure is an important contribution. It proceeds from a seed (such as an uninformative token), iteratively replacing words so as to improve the correlation between the semantic feature vector predicted by the text-to-feature language model and the feature vector decoded from brain activity. The mutated captions considered are constructed by masking out particular words and then generating potential replacements using another language model (RoBERTa-large) that has been trained with masking of tokens to predict missing tokens. This masked language model provides probable completions and thus constrains the search to natural text descriptions, while the candidate descriptions best matching the decoded features are selected for further optimization.

The study also applies the decoder not only to brain activity measured while subjects view videos, but also to activity measured while subjects recall and imagine videos they previously viewed. Recall-based imagery can be decoded at levels far above chance, though much lower than perception, in high-level visual cortex. Careful encoding and decoding analyses demonstrate that information about the videos is widespread throughout the human cortex, including in the language network. However, excluding the language network for decoding did not substantially reduce decoding performance. This is a key result because the goal of brain-to-text decoding is not the decoding of verbal thoughts, but the use of text to capture the information in all kinds of brain representations, most of which are not verbal. Language is an excellent format for decoding because it can capture concrete as well as abstract information. Unlike a decoder that outputs images, a text decoder can leave out information that is unspecified in the representation being decoded. 

A central claim of the study is that the results support the hypothesis that high-level visual cortex contains structured semantic representations that capture not only the sets of objects present in the scene but also their relationships (such as “man bites dog” as opposed to “dog bites man”). In addition, the author suggests that the text synthesis approach enables “faithful” decoding unbiased, or at least less biased, by prior knowledge than previous approaches (e.g. using a caption database).

Overall, this is excellent work, tackling a grand decoding challenge with many original and inspiring ideas which are expertly implemented. The analyses in the main paper and the supplementary analyses are careful and comprehensive. The examples of decoded text are impressive. However, the claim of “faithful” or “unbiased” decoding does not make sense to me. Arguably it is not even desirable to decode without prior information (i.e. without bias): To understand what the information in the brain “means”, we need to interpret it in light of what we know about the world. After all, the rest of the brain that is using the representation is also interpreting it in the context of what it knows about the world. The author should either rigorously justify these claims or leave them out.

The claims about structured semantic representation and representation of relationships may also need to be tempered a bit. I am unsure if the word shuffling analyses supporting this claim may be compromised by the fact that the resulting text is not within the distribution that the text-to-feature language model was trained on. Really addressing the structured relational semantics hypothesis would require out-of-distribution tests such as a video of a man biting a dog (an example the author introduces in the discussion), whose decoding might reveal to what extent the decoder relies on the brain representation and to what extent it infers the structure in the decoded text using its prior knowledge of the world. The paper could also be further improved by discussing the motivations for the choices made in designing the decoder and alternative choices and why they are promising or not promising. 

Even if some of the claims need adjustment, this is an excellent and highly original contribution that will be of broad interest to neuroscientists and researchers in other fields.  

Suggestions

  1. Fully justify or weaken claims of “faithful” decoding unbiased by prior information. 
  2. Add a figure and table clarifying the different formats of information (video, visual features, captions, semantic features, brain activity) and all the transformations (v2t by humans, t2f by language models, b2f by linear decoder, f2t by iterative text synthesis).
  3. Add a section to the discussion motivating the particular choices for these transformations. For example, why should brain activity and text be aligned at the level of the semantic features? Why not learn to map directly from brain activity to text? Why use an interactive inversion of the t2f model, rather than learning a direct f2t mapping? How well does the text-to-feature model preserve the information in the text? If presented with the feature vectors corresponding to a set of independent draws from the training distribution of captions (different captions, but IID), how well does the optimization method recover the description? How much of the information in the recovered verbal description is encoded in the semantic features and how much comes from the prior implicit to the text-to-feature encoder?
  4. Add a section to the discussion addressing whether “faithful” or “unbiased” decoding is even well-defined as an ideal – whether or not it is achievable in practice. 

Strengths

  • The paper addresses an inspiring and important challenge with scientific and applied dimensions.
  • Decoders are applied not only to data acquired during the viewing of videos, but also during memory-recall-driven mental imagery.
  • The iterative text synthesis decoding procedure is original and powerful. 
  • The methods are original and state of the art.
  • The encoding and decoding analyses are comprehensive and careful, with extensive supplementary analyses and single-subject results, presenting a rich picture.
  • The paper uses and compares a wide range of current neural-network language models, which provide alternative semantic feature spaces.

Weaknesses

  • The study attempts something that may be impossible: To “faithfully” reveal the structured semantic information explicitly represented in the brain. Prior information about the language and our world inevitably informs the decoded text. It is unclear what it would even mean to decode into text without prior information.
  • The paper claims that the text synthesis procedure is not biased by knowledge about the world, but both the caption to semantic feature language models and the masked language model used to guide the iterative synthesis have massive knowledge of relational structure in the world that we should expect to constrain the decoded text. 
  • The study does not include strong out-of-distribution probes of the decoders, which could reveal to what extent the relational semantic information originates from compositional brain representations or is inferred using world knowledge by the decoder.