Electrical and Computer Engineering Ph.D. Public Defense

Scalable and Efficient Models for Bidirectional Grounded Language Communication

Jacob Arkin

Supervised by Thomas M. Howard

Tuesday, March 7, 2023
2:32 p.m.–2:30 p.m.

601 Computer Studies Building

One of the major promises of robotics and artificial intelligence is the development of autonomous agents that can work in tandem with human teammates to accomplish a wide variety of tasks in many different kinds of environments. Robust human-robot collaborative task execution requires a mode of bidirectional communication in order to successfully coordinate among teammates while executing and delegating tasks. Natural language is an especially flexible and intuitive candidate; thus it is a long term goal to provide collaborative robots with the ability to understand and generate natural language pertaining to a task and its execution in the surrounding physical world.

One modern formulation treats this problem as finding an association between language and a robot-interpretable symbolic representation of physical concepts, such as objects or actions. Probabilistic graphical models have been used to factorize this association as a means of simplifying the computation. The work in this thesis focuses on a particular class of models known as Distributed Correspondence Graphs that factorize over both constituents of language (e.g. phrases) and constituents of the symbolic representation.

These graphs have been successfully deployed in the field as part of robotic intelligence architectures for multi-step human-robot collaboration in complex, diverse environments. However, there remain situations in which the scalability and efficiency of these models are a limitation. The fundamental dependence on the world implies that a constructed graph becomes invalid when the world changes, limiting the scalability of these models to dynamic environments. For language generation, many such graphs must be constructed to find a good language candidate for the robot’s desired expression of meaning; while fast for a single utterance, the accumulation of runtime for many graphs limits the scalability of these graphs to many language candidates. The association between language and symbols of physical concepts is learned in a data-driven manner, and the required expert labor for data annotation is a bottleneck that limits the scalability of these learned models to novel domains. The work presented in this thesis tackles each of these limitations.

Addressing the scalability to dynamic environments, this thesis presents a novel algorithm with which to efficiently update a given graph in the con- text of a reconfigured world, thereby avoiding repeated graph construction and significantly improving the runtime of inference over the baseline. To address the many language candidates of language generation, this thesis proposes a novel unified graphical model that efficiently shares both linguistic constituents of different language candidates and features of the learned model to significantly reduce the runtime of language generation and improve the scalability with respect to the number of language candidates. Finally, to address the bottleneck of annotated data, this thesis proposes a semi-supervised annotation process to automatically find the best annotations of a provided partially-labeled corpus and eliminate much of the manual labor, thus improving the scalability of these learned models to novel domains and extended corpora. Included is a discussion of the syngeries that exist among these contributions and how the exploitable structure of the graphical model is fundamental to finding opportunities for improved efficiency.