Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
EMNLP 2021

  • Sonia Raychaudhuri
    Simon Fraser University
  • Saim Wani
    IIT Kanpur
  • Shivansh Patel
    IIT Kanpur
  • Unnat Jain
    UIUC
  • Angel X. Chang
    Simon Fraser University
teaser-image
Language-aligned path (blue) vs Goal-oriented path (red) i.e. the shortest path to the goal

Abstract

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle ‘off the path’ scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent’s location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

LAW Supervision

Rather than supervising with the conventional goal-oriented sensor, we supervise with a language-aligned sensor, which helps bring the agent back on the path to the next waypoint if it wanders off.

model-diagram

Results

Our LAW pano outperforms goal across all metrics, including the instruction-following metrics, suggesting that language-aligned supervision encourages the agent to follow instructions better than goal-oriented supervision.

quant_results

Qualitative Analysis

Agent performance binned by nDTW value of reference path to shortest path (95% CI error bars) shows that LAW pano outperforms goal, especially on lower-range NDTW episodes. This indicates that language-aligned supervision is better suited for the instruction following task.

law-better-than-goal

The figure below shows an example episode from R2R unseen split. The agent is able to learn to follow instruction better with LAW pano(right) than goal(lef). This is reflected in higher nDTW and waypoint accuracy (WA) metrics. It also shows the mapping of sub-instructions to waypoints utilizing FG-R2R for this episode.

qualitative-analysis

Citation

If you use our LAW method in your research, please cite the following:
@inproceedings{raychaudhuri2021language,
  title={Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments},
  author={Raychaudhuri, Sonia and Wani, Saim and Patel, Shivansh and Jain, Unnat and Chang, Angel},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={4018--4028},
  year={2021}
}
          

If you use the continuous environment setting of the VLN-CE paper, please additionally cite:
@inproceedings{krantz_vlnce_2020,
  title={Beyond the Nav-Graph: Vision and Language Navigation in Continuous Environments},
  author={Jacob Krantz and Erik Wijmans and Arjun Majundar and Dhruv Batra and Stefan Lee},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2020}
  }
          

Acknowledgements

We thank Jacob Krantz for the VLN-CE code on which this project was based, Erik Wijmans for initial guidance with reproducing the original VLN-CE results, and Manolis Savva for discussions and feedback. We also thank the anonymous reviewers for their suggestions and feedback. This work was funded in part by a Canada CIFAR AI Chair and NSERC Discovery Grant, and enabled in part by support provided by WestGrid and Compute Canada.

Authors