Facebook AI researchers have created a pair of AI systems that are able to navigate the streets of New York City using only 360-degree images, natural language, and a map with local landmarks like banks and restaurants for guidance. The research task and dataset named Talk the Walk is being open-sourced today alongside initial results of the real-world training being published on Arxiv today. The two AI systems are trained to complete two specific tasks: The tourist bot must describe its surroundings to the guide bot, which then interprets the tourists location based on the description and use of a map. Agents were only given the ability to move forward, left, or right at intersections within two city blocks. Tourist agents could only describe their location for the guide using a map with no street names. Natural language used in the exercise was created from transcripts of text from humans who completed the same task. What sets this apart from those other datasets is we have actual natural language annotations, so its not some kind of artificially templated language, which other people have tried. This is the first instance where its real language with real visual perception, Facebook AI research scientist Douwe Kiela told VentureBeat in a phone interview. Talk the Walk involves two AI systems in a two-block radius in Hells Kitchen, East Village, Financial District, and Upper East Side in Manhattan, and the Williamsburg neighborhood in Brooklyn. Complicating matters a bit, each of the neighborhoods follows a grid system so the maps have no distinctive qualities. A two-block radius with 16 different street corners may seem small; however, the original study started covering more ground but had to be reduced because it proved too hard for humans to complete. Its an important task because it brings together a lot of different challenges that we need to solve if we want to make progress with AI research, so things like realistic 360 visual perception, map-based navigation, visual reasoning, natural language communication by dialogue — all of these things are important to solve problems in AI. And what this work is about is trying to bring all these problems together into an overarching, all-encompassing kind of solution, Kiela said. While 360 video and a map were part of input that trained the systems, the task and benchmark dataset is primarily geared toward the advancement of conversational AI, said Kiela, whose work has centered on grounding, the practice of using multimodal methods to develop natural language understanding. To reach one another requires successful communication, both from the tourist telling the guide where it is with natural language and the guide that must interpret words generated by the tourist agent. The long term vision of this kind of research is improving natural language understanding, and so that of course is interesting to humankind. Basically, if we can achieve artificial intelligence where agents actually understand natural language, then that would be kind of a pivotal moment for AI, and I think were not even close to that yet, he said. I really care about this long term vision, first and foremost, of how can we get to this kind of language understanding and how can we get AI that really has this kind of common sense that has been missing up until now. An attention mechanism called Masked Attention for Spatial Convolution (MASC) was used to narrow the focus of the agents, and produced results that at times made the agents twice as likely to complete the task. The resulting task and dataset were made to act as a benchmark. The work is being open-sourced so others in the AI community can advance the current state of machine understanding of human communication skills. This is a difficult challenge, and thats also one of the reasons were open-sourcing it and inviting people to think about this kind of problem. In general we should have more hard challenges in AI research and difficult problems for the community to tackle and realize also what the limitations are of what we can currently do. And so the open-sourcing thing is important to us, and thats why were happy to share with the scientific community, he said. In my opinion this really is the way forward with AI. If we dont have this, then its going to look like were making a lot of progress, but were not really making the kind of progress that we should be making. To view or download the dataset, visit this code.fb.com website.
Virtual guides help a 'lost' AI find its way. As a general rule, AI isn't great at using new info to make better sense of existing info. Facebook thinks it has a clever (if unusual) way to explore solutions to this problem: send AI on a virtual vacation. It recently conducted an experiment that had a "tourist" bot with 360-degee photos try to find its way around New York City's Hell's Kitchen area with the help of a "guide" bot using 2D maps. The digital tourist had to describe where it was based on what it could see, giving the guide a point of reference it can use to offer directions. The project focused on collecting info through regular language ("in front of me there's a Brooks Brothers"), but it produced an interesting side discovery: the team learned that the bots were more effective when they used a "synthetic" chat made of symbols to communicate data. In other words, the conversations they'd use to help you find your hotel might need to be different than those used to help, say, a self-driving car. The research also helped Facebook's AI make sense of visually complex urban environments. A Masked Attention for Spatial Convolution system could quickly parse the most relevant keywords in their responses, so they could more accurately convey where they were or needed to go. As our TechCrunch colleagues observed, this is a research project that could improve AI as a whole rather than the immediate precursor to a navigation product. With that said, it's easy to see practical implications. Self-driving cars could use this to find their way when they can't rely on GPS, or offer directions to wayward humans using only vague descriptions.