Few-Shot Learning through Inverse Generative Modeling

1Massachusetts Institute of Technology, 2Harvard University, 3Johns Hopkins University

NeurIPS 2024


Training tasks (214 demos)

push on surface (2x)

push around bowl (2x)

pick-and-place on table (2x)

pick-and-place on book (2x)


Test task (10 demos)

?


FTL-IGM learns to generate behavior conditioned on task representations (text embeddings of task descriptions). Then, it learns new task latent representations from few state-based demonstrations (in this case, videos).

Abstract

Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning, and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains -- object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.

Real-World Table-Top Manipulation

We collect demonstrations with a Franka Research 3 robot via teleop. We generate training and test pushing with our model conditioned on different representations.

training task: push on surface, representation: training push on surface (2x)

test task: push on book, representation: test inferred representation (4x)

test baseline task: push on book, representation: training push on surface (4x)

Autonomous Driving

In the highway environment, the green vehicle is controlled by the model, blue vehicles are controlled by a separate controller, red indicates collision. In all scenarios the controlled vehicle must maintain a high speed and avoid collisions. In the highway scenarios it must stay on the rightmost lanes, in exit make the exit, in merge allow a vehicle to merge, in intersection make a left turn and in roundabout take the second exit.


Training tasks (200 demos)

highway

exit

merge

intersection


Test task (5 demos)

BC

VAE

In-Context

Ours


Motion Capture

The CMU Motion Capture Database is collected from real humans performing actions.


Training tasks (2210 demos)

walk

run

march


Test tasks (3 demos)

jumping jacks

BC

VAE

In-Context

Language

Ours

Demo



breaststroke

BC

VAE

In-Context

Language

Ours

Demo


Learned New Concepts + Training Concepts Compositions

walk (demo)

jump (demo)

march (demo)

jumping jacks (demo)

jumping jacks + walk

jumping jacks + jump

jumping jacks + march

jumping jacks (learned)


Learning New Concepts as 1 Concept vs. Compositions of 2 Concepts

jumping jacks 1 concept

jumping jacks 2 concepts

breaststroke 1 concept

breaststroke 2 concepts


Goal-Oriented Navigation

In the AGENT environment, an agent navigates to one of two targets based on their shape and/or color.


Training tasks: target defined by single attribute, shape or color (900 demos)

go to red object

go to yellow object

go to cube

go to bowl


Test tasks: target defined by attribute compositions, shape and color (5 demos)

BC

VAE

In-Context

Ours


Object Rearrangement

In the Object Rearrangement environment, three objects need to be positioned in a certain configuration that satisfies spatial relations between them.


Training tasks: single pairwise relations (11k demos)

triangle above circle

circle above triangle

triangle right of circle

square above circle


Test tasks: new concepts that are training concept compositions: two pairwise relations (5 demos)

triangle right of square + circle above square

square right of triangle + circle above triangle

circle right of square + triangle above square

line: circle right of triangle + triangle right of square


Test tasks: new concepts that are not explicit training concept compositions in natural language symbolic space (5 demos)

all objects on circle circumference of radius 1.67

square diagonal to triangle

triangle diagonal to square

circle diagonal to triangle


Generating a new concept (diagonal) composed with a training pairwise relation

square diagonal to triangle + circle above square

square diagonal to triangle + circle above triangle

square diagonal to triangle + circle right of triangle

square diagonal to triangle + triangle above circle


t-SNE Analysis

New concepts that are not explicit compositions in the natural language symbolic space of training concepts for Driving, MoCap and Object Rearrangement. Hover over data points to view which concept they represent (training representations are blue, learned new concept representations are red).

t-SNE Visualization Carousel

Related Links

There's excellent related work in computer vision and decision making.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion introduces few-shot visual concept inference through inverse generative modeling.

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models demonstrates inferring visual concepts as compositions.

Is Conditional Generative Modeling All You Need For Decision-Making? demonstrates generating behavior in decision-making conditioned on task compositions.

BibTeX

@inproceedings{netanyahu2024fewshot,
  author    = {Netanyahu, Aviv and Du, Yilun and Bronars, Antonia and Pari, Jyothish and Tenenbaum, Joshua and Shu, Tianmin and Agrawal, Pulkit},
  title     = {Few-Shot Task Learning through Inverse Generative Modeling},
  booktitle   = {Advances in Neural Information Processing Systems},
  year      = {2024},
}