Playable Environments: Video Manipulation in Space and Time

CVPR 2022

Willi Menapace^☨, Stéphane Lathuilière^, Aliaksandr Siarohin,
Christian Theobalt^, Sergey Tulyakov^, Vladislav Golyanik^, Elisa Ricci^*

^☨ Work partially done while interning at MPI for Informatics
^* Equal senior contribution

Overview Datasets Interactive Videos Action Conditioning Reconstruction Camera Manipulation Style Manipulation

Interactive Sequences

In the following, we show videos interactively produced by a user starting from a single frame. At each frame the user specifies a discrete action for each player to condition video generation. Our method is capable of producing videos with a length of several minutes. In addition it handles difficult situations where the players are moved outside of the frame or in regions that are not frequently seen during training, e.g. close to the net in tennis.

Minecraft

Tennis

On the Minecraft dataset the following actions are learned: bottom-left movement (1), small right movement (2), top-left movement (3), right movement (4), bottom movement (5), small left movement (6), top right movement (7).

On the Tennis dataset the following actions are learned for the player closer to the camera: bottom-left movement (1), small right movement (2), forward movement (3), small left movement (4), right movement (5), bottom-right movement (6), left movement (7). For the player positioned further from the camera the following actions are learned: right movement (1), backward movement (2), forward-left movement (3), backward-left movement (4), small backward movement (5), forward movement (6), small forward movement (7).