In the following video we show the effects of actions on the generated video sequence. In each row we consider a starting frame and in each column we condsider a learned action. We generate a video starting from the initial frame for each of the learned actions.
Our method learns a set of actions whose meaning is consistent and independent from the starting frame. The model learns actions that correspond to the main movement directions. Note that each action is expressed relative to the current orientation of the camera.
Similarly to the Minecraft dataset, our method learns a consistent action representation. The actions consistently capture each of the possible player movements.