Playable Video Generation

Supplementary Material

CVPR 2021 (Oral)

Willi Menapace, Stéphane Lathuilière, Sergey Tulyakov, Aliaksandr Siarohin, Elisa Ricci

Interactively Generated Videos

In this section, we show results of live user interaction with CADDY. Each video shows the real-time output. In addition, we show a window with the user interacting with the keyboard and in the top left corner the current action input.

Moreover, we create a page where the user can directly interact with the CADDY model in browser and interactively generate video sequences.

Atari Breakout

We show for each dataset additional samples of videos interactively generated by users starting from random frames. Each video shows in the top left corner the current action input.


Atari Breakout


Action Conditioning Evaluation

The following videos show the effect that each learned action produces on the output sequence. We consider in each row an initial frame and, for each of the learned actions, we produce a sequence by repeatedly using the current action as input. The output corresponding to each action is shown in a corresponding column.


CADDY learns actions with a consistent meaning that is independent from the initial frame. Action 1 and Action 7 capture movement to the left with different speed. The model also captures forward and lifting (Action 2 and Action 6), backward (Action 3) and right (Action 5) movement. Action 4 instead corresponds to simultaneously lifting and moving the arm to the right.

Atari Breakout

CADDY learns actions with a consistent meaning that is independent from the initial frame. Action 1 corresponds to no movement, while Action 2 and 3 correspond to left and right movement respectively. We found that when the player-controlled platform is positioned at the right border, Action 3 moves the platform to the left instead of leaving it in the current position.


CADDY learns actions with a consistent meaning that is independent from the initial frame. The model captures left (Action 7), right (Action 4), forward (Action 2) and backward (Action 3) movement, no movement (Action 6) and hitting movements (Action 1 and Action 5). These hitting actions are associated with lateral movement of the player.

The last row shows sequences generated in an atypical scenario where the player is close to the net. Even in this difficult situation where training data was scarce, the model maintains a consistent meaning for the actions and generates the player until it is not too close to the net or performing movements that are not present in the training data such as going backwards once being close to the net, an action typically avoided by professional tennis players.

In the following, we show the same results produced by the baseline methods.





Action Variability Embeddings Evaluation

In the following video we show the effects that can be achieved by manipulating the action variability embeddings. In particular, we show that given two discrete actions it is possible to produce videos that correspond to actions with intermediate meaning between the two. This makes it possible to build continuous actions on top of the learned discrete actions.

Reconstruction Results

The following videos show examples of reconstructed sequences.


While baselines such as SAVP and SAVP+ reconstruct videos where the robot arm movement correlates with the one in the original sequence, our method translates the relative movements of the robot arm into the reconstructed sequence with greater accuracy. In addition, in our model the appearance of the robot arm remains consistent in the whole sequence, while the quality of the head tends to degenerate in the baselines as the video progresses, especially in MoCoGAN+ and SAVP+.

Atari Breakout

In this dataset, CADDY accurately learns the physics of the player-controlled platform, of the bricks and of the ball, and correctly predicts the ball position even after multiple bounces. Moreover, the model learns actions that allow the position of the platform to closely match the one in the original sequence. On the other hand, the baselines present artifacts such as disappearing or duplicating platform, and the position of the platform does not match the one in the original sequence.


In the Tennis dataset, SAVP, SAVP+ and CADDY produce reconstructions where the movement of the player matches the one in the original sequence, however our method reconstructs its position more accurately. Moreover, all baselines present artifacts such as disappearning player, while our method consistently generates the player.