☨ Work partially done while interning at MPI for Informatics
* Equal senior contribution
In the following, we show videos interactively produced by a user starting from a single frame. At each frame the user specifies a discrete action for each player to condition video generation. Our method is capable of producing videos with a length of several minutes. In addition it handles difficult situations where the players are moved outside of the frame or in regions that are not frequently seen during training, e.g. close to the net in tennis.
On the Minecraft dataset the following actions are learned: bottom-left movement (1), small right movement (2), top-left movement (3), right movement (4), bottom movement (5), small left movement (6), top right movement (7).
On the Tennis dataset the following actions are learned for the player closer to the camera: bottom-left movement (1), small right movement (2), forward movement (3), small left movement (4), right movement (5), bottom-right movement (6), left movement (7). For the player positioned further from the camera the following actions are learned: right movement (1), backward movement (2), forward-left movement (3), backward-left movement (4), small backward movement (5), forward movement (6), small forward movement (7).