push on surface (2x)
push around bowl (2x)
pick-and-place on table (2x)
pick-and-place on book (2x)
?
Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning, and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains -- object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.
We collect demonstrations with a Franka Research 3 robot via teleop. We generate training and test pushing with our model conditioned on different representations.
training task: push on surface, representation: training push on surface (2x)
test task: push on book, representation: test inferred representation (4x)
test baseline task: push on book, representation: training push on surface (4x)
In the highway environment, the green vehicle is controlled by the model, blue vehicles are controlled by a separate controller, red indicates collision. In all scenarios the controlled vehicle must maintain a high speed and avoid collisions. In the highway scenarios it must stay on the rightmost lanes, in exit make the exit, in merge allow a vehicle to merge, in intersection make a left turn and in roundabout take the second exit.
highway
exit
merge
intersection
BC
VAE
In-Context
Ours
The CMU Motion Capture Database is collected from real humans performing actions.
walk
run
march
jumping jacks
BC
VAE
In-Context
Language
Ours
Demo
breaststroke
BC
VAE
In-Context
Language
Ours
Demo
walk (demo)
jump (demo)
march (demo)
jumping jacks (demo)
jumping jacks + walk
jumping jacks + jump
jumping jacks + march
jumping jacks (learned)
jumping jacks 1 concept
jumping jacks 2 concepts
breaststroke 1 concept
breaststroke 2 concepts
In the AGENT environment, an agent navigates to one of two targets based on their shape and/or color.
go to red object
go to yellow object
go to cube
go to bowl
BC
VAE
In-Context
Ours
In the Object Rearrangement environment, three objects need to be positioned in a certain configuration that satisfies spatial relations between them.
triangle above circle
circle above triangle
triangle right of circle
square above circle
triangle right of square + circle above square
square right of triangle + circle above triangle
circle right of square + triangle above square
line: circle right of triangle + triangle right of square
all objects on circle circumference of radius 1.67
square diagonal to triangle
triangle diagonal to square
circle diagonal to triangle
square diagonal to triangle + circle above square
square diagonal to triangle + circle above triangle
square diagonal to triangle + circle right of triangle
square diagonal to triangle + triangle above circle
New concepts that are not explicit compositions in the natural language symbolic space of training concepts for Driving, MoCap and Object Rearrangement. Hover over data points to view which concept they represent (training representations are blue, learned new concept representations are red).
There's excellent related work in computer vision and decision making.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion introduces few-shot visual concept inference through inverse generative modeling.
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models demonstrates inferring visual concepts as compositions.
Is Conditional Generative Modeling All You Need For Decision-Making? demonstrates generating behavior in decision-making conditioned on task compositions.
@inproceedings{netanyahu2024fewshot,
author = {Netanyahu, Aviv and Du, Yilun and Bronars, Antonia and Pari, Jyothish and Tenenbaum, Joshua and Shu, Tianmin and Agrawal, Pulkit},
title = {Few-Shot Task Learning through Inverse Generative Modeling},
booktitle = {Advances in Neural Information Processing Systems},
year = {2024},
}