Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning, and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains -- object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.
We collect demonstrations with a Franka Research 3 robot via teleop. We generate training and test pushing with our model conditioned on different representations.
In the highway environment, the green vehicle is controlled by the model, blue vehicles are controlled by a separate controller, red indicates collision. In all scenarios the controlled vehicle must maintain a high speed and avoid collisions. In the highway scenarios it must stay on the rightmost lanes, in exit make the exit, in merge allow a vehicle to merge, in intersection make a left turn and in roundabout take the second exit.
The CMU Motion Capture Database is collected from real humans performing actions.
jumping jacks
breaststroke
In the AGENT environment, an agent navigates to one of two targets based on their shape and/or color.
In the Object Rearrangement environment, three objects need to be positioned in a certain configuration that satisfies spatial relations between them.
New concepts that are not explicit compositions in the natural language symbolic space of training concepts for Driving, MoCap and Object Rearrangement. Hover over data points to view which concept they represent (training representations are blue, learned new concept representations are red).
There's excellent related work in computer vision and decision making.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion introduces few-shot visual concept inference through inverse generative modeling.
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models demonstrates inferring visual concepts as compositions.
Is Conditional Generative Modeling All You Need For Decision-Making? demonstrates generating behavior in decision-making conditioned on task compositions.
@inproceedings{netanyahu2024fewshot,
author = {Netanyahu, Aviv and Du, Yilun and Bronars, Antonia and Pari, Jyothish and Tenenbaum, Joshua and Shu, Tianmin and Agrawal, Pulkit},
title = {Few-Shot Task Learning through Inverse Generative Modeling},
booktitle = {Advances in Neural Information Processing Systems},
year = {2024},
}