Reinforcement Learning as One Big Sequence Modeling Problem

Reinforcement Learning as One Big Sequence Modeling Problem

Paper


Trajectory Transformer Single-Step Model
Long-horizon predictions of the Trajectory Transformer compared
to those of a feedforward single-step dynamics model.



Summary
We view reinforcement learning as a generic sequence modeling problem and investigate how much of the usual machinery of reinforcement learning algorithms can be replaced with the tools that have found widespread use in large-scale language modeling. The core of our approach is the Trajectory Transformer, trained on sequences of states, actions, and rewards treated interchangeably, and a set of beam-search-based planners.



Transformers as dynamics models
Predictive dynamics models often have excellent single-step error, but poor long-horizon accuracy due to compounding errors. We show that Transformers are more substantially more reliable long-horizon predictors than state-of-the-art single-step models, even in continuous Markovian domains.



         
Attention patterns of Trajectory Transformer, showing (left) a discovered
Markovian stratetgy and (right) an approach with action smoothing.



Beam search as trajectory optimizer
  • Decoding a Trajectory Transformer with unmodified beam search gives rise to a model-based imitative method that optimizes for entire predicted trajectories to match those of an expert policy.
  • Conditioning trajectories on a future desired state alongside previously-encountered states yields a goal-reaching method.

  •        
      Start       Goal

  • Replacing log-probabilities from the sequence model with reward predictions yields a model-based planning method, surprisingly effective despite lacking the details usually required to make planning with learned models effective.

  •               

Reinforcement Learning as One Big Sequence Modeling Problem



Related Publication
Chen et al concurrently proposed another sequence modeling approach to reinforcement learning. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (e.g., in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (e.g., Atari). We encourage you to check out their work as well.