Summary - Neural Module Networks

Research Paper -

Authors - Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein

Key Idea

Parse questions of visual QA into a description of compositions of functions. These functions are neural networks called Neural Modules. Execute the neural networks and reweigh the resulting label using question representation.

Architecture diagram

Task - Visual Question Answering

Given a question like “What color is the coffee mug?” and an image we want to predict the answer.

4  visual qa examples. One example: Two images having a man and women. One with man wearing glasses and another image with woman wearing glasses.Question, "Who is wearing glasses?" and respective answer below the image

Prior approaches

  • End to End neural networks
    • Use a CNN to vectorize the image and RNN to vectorize the question and use a feed forward network to classify the answer.
    • This is a black box trying to answer in one shot.
  • Semantic Parsing approach
    • Parse the question into logical expressions, image into logical representation of the world and use logic based reasoning to solve the problem.
    • This is more compositional.


  • Combine the representational capacity of neural nets and compositionality of symbolic approach.
  • So, “Rather than thinking of question answering as a problem of learning a single function to map from questions and contexts to answers, it’s perhaps useful to think of it as a highly-multitask learning setting, where each problem instance is associated with a novel task, and the identity of that task is expressed only noisily in language.”
  • Simple example - “Is this a truck?” - Needs single task to be performed, namely truck or not classification.
  • Compositional example - “What is the object to the left of the tea pot?” - Needs one to find the teapot, detect object to its left, then classify the object.


Neural Modules

  • Identify set of modules that can be composed to solve all/most tasks.
  • Modules can be thought of as a function parametrized by a neural network, with a type signature.
  • Data Types - Image, Unnormalized attention map, labels
    • Attention Module
    • Classification Module
    • Reattention module
    • Combination module
    • Measurement Module

Strings -> Modules

  • Parsing
    • Use few rules on dependency parse of the question to convert it into a structured query.
    • e.g. “Is there a circle next to a square?” -> is(circle, next-to(square))
  • Layout
    • “All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types.”
      • The queries could come from anywhere not just natural language question. As long as they can be converted to a layout in the end.


  • An RNN is used to process the question and predict a label directly without looking into the image.
  • This is combined with the final label from the root node of the Neural Modules using geometric mean to get the final result.
  • This is done for 2 reasons
    • Syntactic Regularity/Prior
      • When converting to structured query, certain syntactic elements are lost.
      • For e.g. What is in the sky? and What are in the sky? both result in what(fly).
      • But answer varies from kite to kites.
    • Semantic Regularity/Prior
      • Some answers are unreasonable just by inspecting the question.
        • For example, What colour is the bear? eliminates all non-colour answers.


  • They try this in vqa dataset - a huge dataset with natural images and questions with answers.
    • Benchmarks table for VQA
  • Since VQA doesn’t have many deep compositional questions, they use shapes a synthetically generated dataset.
    • Synthetic Shapes dataset


  • What colour is his tie? Statue of a man with yellow tie, question parsed to modules 1. find tie  2. describe colour
  • Correct and incorrect predictions