Reinforcement Learning as One Big Sequence Modeling Problem

Offline Reinforcement Learning as
One Big Sequence Modeling Problem

Trajectory Transformer Single-Step Model
Long-horizon predictions of the Trajectory Transformer compared
to those of a feedforward single-step dynamics model.

We view reinforcement learning as a generic sequence modeling problem and investigate how much of the usual machinery of reinforcement learning algorithms can be replaced with the tools that have found widespread use in large-scale language modeling. The core of our approach is the Trajectory Transformer, trained on sequences of states, actions, and rewards treated interchangeably, and a set of beam-search-based planners.


Transformers as dynamics models
Predictive dynamics models often have excellent single-step error, but poor long-horizon accuracy due to compounding errors. We show that Transformers are more reliable long-horizon predictors than state-of-the-art single-step models, even in continuous Markovian domains.

Attention patterns of the Trajectory Transformer, showing (left) a discovered
Markovian stratetgy and (right) an approach with action smoothing.

Beam search as trajectory optimizer
  • Decoding a Trajectory Transformer with unmodified beam search gives rise to a model-based imitative method that optimizes for entire predicted trajectories to match those of an expert policy.
  • Conditioning trajectories on a future desired state alongside previously-encountered states yields a goal-reaching method.

      Start       Goal

  • Replacing log-probabilities from the sequence model with reward predictions yields a model-based planning method, surprisingly effective despite lacking the details usually required to make planning with learned models effective.

Offline Reinforcement Learning as One Big Sequence Modeling Problem

Related Publication
Chen et al concurrently proposed another sequence modeling approach to reinforcement learning. At a high-level, ours is more model-based in spirit and theirs is more model-free, which allows us to evaluate Transformers as long-horizon dynamics models (e.g., in the humanoid predictions above) and allows them to evaluate their policies in image-based environments (e.g., Atari). We encourage you to check out their work as well.