Artificial Intelligence

Pre-training generalist brokers utilizing offline reinforcement studying – Google AI Weblog

Pre-training generalist brokers utilizing offline reinforcement studying – Google AI Weblog
Written by admin


Reinforcement studying (RL) algorithms can study expertise to unravel decision-making duties like enjoying video games, enabling robots to choose up objects, and even optimizing microchip designs. Nevertheless, operating RL algorithms in the actual world requires costly lively knowledge assortment. Pre-training on numerous datasets has confirmed to allow data-efficient fine-tuning for particular person downstream duties in pure language processing (NLP) and imaginative and prescient issues. In the identical means that BERT or GPT-3 fashions present general-purpose initialization for NLP, giant RL–pre-trained fashions may present general-purpose initialization for decision-making. So, we ask the query: Can we allow comparable pre-training to speed up RL strategies and create a general-purpose “spine” for environment friendly RL throughout numerous duties?

In “Offline Q-learning on Various Multi-Job Information Each Scales and Generalizes”, to be printed at ICLR 2023, we talk about how we scaled offline RL, which can be utilized to coach worth features on beforehand collected static datasets, to offer such a normal pre-training methodology. We display that Scaled Q-Studying utilizing a various dataset is enough to study representations that facilitate fast switch to novel duties and quick on-line studying on new variations of a activity, bettering considerably over current illustration studying approaches and even Transformer-based strategies that use a lot bigger fashions.

Scaled Q-learning: Multi-task pre-training with conservative Q-learning

To offer a general-purpose pre-training method, offline RL must be scalable, permitting us to pre-train on knowledge throughout completely different duties and make the most of expressive neural community fashions to amass highly effective pre-trained backbones, specialised to particular person downstream duties. We primarily based our offline RL pre-training methodology on conservative Q-learning (CQL), a easy offline RL methodology that mixes customary Q-learning updates with a further regularizer that minimizes the worth of unseen actions. With discrete actions, the CQL regularizer is equal to an ordinary cross-entropy loss, which is a straightforward, one-line modification on customary deep Q-learning. A number of essential design selections made this potential:

  • Neural community measurement: We discovered that multi-game Q-learning required giant neural community architectures. Whereas prior strategies usually used comparatively shallow convolutional networks, we discovered that fashions as giant as a ResNet 101 led to vital enhancements over smaller fashions.
  • Neural community structure: To study pre-trained backbones which are helpful for brand spanking new video games, our last structure makes use of a shared neural community spine, with separate 1-layer heads outputting Q-values of every sport. This design avoids interference between the video games throughout pre-training, whereas nonetheless offering sufficient knowledge sharing to study a single shared illustration. Our shared imaginative and prescient spine additionally utilized a realized place embedding (akin to Transformer fashions) to maintain monitor of spatial data within the sport.
  • Representational regularization: Latest work has noticed that Q-learning tends to undergo from representational collapse points, the place even giant neural networks can fail to study efficient representations. To counteract this situation, we leverage our prior work to normalize the final layer options of the shared a part of the Q-network. Moreover, we utilized a categorical distributional RL loss for Q-learning, which is understood to offer richer representations that enhance downstream activity efficiency.

The multi-task Atari benchmark

We consider our method for scalable offline RL on a set of Atari video games, the place the objective is to coach a single RL agent to play a group of video games utilizing heterogeneous knowledge from low-quality (i.e., suboptimal) gamers, after which use the ensuing community spine to shortly study new variations in pre-training video games or fully new video games. Coaching a single coverage that may play many alternative Atari video games is tough sufficient even with customary on-line deep RL strategies, as every sport requires a distinct technique and completely different representations. Within the offline setting, some prior works, reminiscent of multi-game choice transformers, proposed to dispense with RL fully, and as a substitute make the most of conditional imitation studying in an try and scale with giant neural community architectures, reminiscent of transformers. Nevertheless, on this work, we present that this sort of multi-game pre-training might be performed successfully through RL by using CQL together with a couple of cautious design selections, which we describe under.

Scalability on coaching video games

We consider the Scaled Q-Studying methodology’s efficiency and scalability utilizing two knowledge compositions: (1) close to optimum knowledge, consisting of all of the coaching knowledge showing in replay buffers of earlier RL runs, and (2) low high quality knowledge, consisting of information from the primary 20% of the trials within the replay buffer (i.e., solely knowledge from extremely suboptimal insurance policies). In our outcomes under, we evaluate Scaled Q-Studying with an 80-million parameter mannequin to multi-game choice transformers (DT) with both 40-million or 80-million parameter fashions, and a behavioral cloning (imitation studying) baseline (BC). We observe that Scaled Q-Studying is the one method that improves over the offline knowledge, attaining about 80% of human normalized efficiency.

Additional, as proven under, Scaled Q-Studying improves when it comes to efficiency, however it additionally enjoys favorable scaling properties: simply as how the efficiency of pre-trained language and imaginative and prescient fashions improves as community sizes get larger, having fun with what is usually referred as “power-law scaling”, we present that the efficiency of Scaled Q-learning enjoys comparable scaling properties. Whereas this can be unsurprising, this sort of scaling has been elusive in RL, with efficiency usually deteriorating with bigger mannequin sizes. This implies that Scaled Q-Studying together with the above design selections higher unlocks the power of offline RL to make the most of giant fashions.

Superb-tuning to new video games and variations

To judge fine-tuning from this offline initialization, we think about two settings: (1) fine-tuning to a brand new, fully unseen sport with a small quantity of offline knowledge from that sport, equivalent to 2M transitions of gameplay, and (2) fine-tuning to a brand new variant of the video games with on-line interplay. The fine-tuning from offline gameplay knowledge is illustrated under. Observe that this situation is usually extra favorable to imitation-style strategies, Determination Transformer and behavioral cloning, because the offline knowledge for the brand new video games is of comparatively high-quality. Nonetheless, we see that generally Scaled Q-learning improves over different approaches (80% on common), in addition to devoted illustration studying strategies, reminiscent of MAE or CPC, which solely use the offline knowledge to study visible representations reasonably than worth features.

Within the on-line setting, we see even bigger enhancements from pre-training with Scaled Q-learning. On this case, illustration studying strategies like MAE yield minimal enchancment throughout on-line RL, whereas Scaled Q-Studying can efficiently combine prior data concerning the pre-training video games to considerably enhance the ultimate rating after 20k on-line interplay steps.

These outcomes display that pre-training generalist worth operate backbones with multi-task offline RL can considerably increase efficiency of RL on downstream duties, each in offline and on-line mode. Observe that these fine-tuning duties are fairly tough: the varied Atari video games, and even variants of the identical sport, differ considerably in look and dynamics. For instance, the goal blocks in Breakout disappear within the variation of the sport as proven under, making management tough. Nevertheless, the success of Scaled Q-learning, significantly as in comparison with visible illustration studying methods, reminiscent of MAE and CPC, means that the mannequin is the truth is studying some illustration of the sport dynamics, reasonably than merely offering higher visible options.

Superb-tuning with on-line RL for variants of the sport Freeway, Hero, and Breakout. The brand new variant utilized in fine-tuning is proven within the backside row of every determine, the unique sport seen in pre-training is within the high row. Superb-tuning from Scaled Q-Studying considerably outperforms MAE (a visible illustration studying methodology) and studying from scratch with single-game DQN.

Conclusion and takeaways

We introduced Scaled Q-Studying, a pre-training methodology for scaled offline RL that builds on the CQL algorithm, and demonstrated the way it allows environment friendly offline RL for multi-task coaching. This work made preliminary progress in direction of enabling extra sensible real-world coaching of RL brokers as a substitute for expensive and complicated simulation-based pipelines or large-scale experiments. Maybe in the long term, comparable work will result in typically succesful pre-trained RL brokers that develop broadly relevant exploration and interplay expertise from large-scale offline pre-training. Validating these outcomes on a broader vary of extra life like duties, in domains reminiscent of robotics (see some preliminary outcomes) and NLP, is a crucial route for future analysis. Offline RL pre-training has numerous potential, and we anticipate that we are going to see many advances on this space in future work.

Acknowledgements

This work was performed by Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Particular because of Sherry Yang, Ofir Nachum, and Kuang-Huei Lee for assist with the multi-game choice transformer codebase for analysis and the multi-game Atari benchmark, and Tom Small for illustrations and animation.

About the author

admin

Leave a Comment