Abstract: The fundamental assumption of reinforcement learning in Markov decision
processes (MDPs) is that the relevant decision process is, in fact, Markov.
However, when MDPs have rich observations, agents typically learn by way of an
abstract state representation, and such representations are not guaranteed to
preserve the Markov property. We introduce a novel set of conditions and prove
that they are sufficient for learning a Markov abstract state representation.
We then describe a practical training procedure that combines inverse model
estimation and temporal contrastive learning to learn an abstraction that
approximately satisfies these conditions. Our novel training objective is
compatible with both online and offline training: it does not require a reward
signal, but agents can capitalize on reward information when available. We
empirically evaluate our approach on a visual gridworld domain and a set of
continuous control benchmarks. Our approach learns representations that capture
the underlying structure of the domain and lead to improved sample efficiency
over state-of-the-art deep reinforcement learning with visual features -- often
matching or exceeding the performance achieved with hand-designed compact state