Abstract: In this paper, we propose an approach that spatially localizes the activities
in a video frame where each person can perform multiple activities at the same
time. Our approach takes the temporal scene context as well as the relations of
the actions of detected persons into account. While the temporal context is
modeled by a temporal recurrent neural network (RNN), the relations of the
actions are modeled by a graph RNN. Both networks are trained together and the
proposed approach achieves state of the art results on the AVA dataset.