Colloquium of School of Computer Sciences

Computer Sciences Colloquium - Generalization and Safety in Reinforcement Learning and Control

Aviv Tamar

19 January 2017, 13:20

Schreiber Building, Room 309

Abstract:

Reinforcement learning (RL) is an area of machine learning that covers learning decision making and control through trial and error. Motivated by recent impressive results of combining RL with deep neural networks, there is a renewed excitement in the field, and a promise for autonomous robot control, among other AI domains.

This talk will focus on two challenges at the forefront of RL research: how to ensure the safety of the learned policy with respect to various sources of uncertainty, and how to learn policies that can generalize well to variations in the task and environment.

Our approach to safety is inspired by ideas from mathematical finance, and the theory of risk-sensitive decision making. We show that by incorporating risk measures into the RL optimization objective, we can learn policies that guarantee safety against noise and modelling errors. We develop efficient algorithms for learning with such risk-sensitive objectives, and provide theoretical convergence guarantees and error bounds for our methods.

To better cope with generalization in deep RL, we introduce the Value Iteration Network (VIN): a fully differentiable neural network with a 'planning module' embedded within. VINs are suitable for predicting outcomes that involve planning-based reasoning, such as policies for RL and control. Key to our approach is a novel differentiable approximation of the value iteration planning algorithm, which can be represented as a convolutional neural network. This network can be trained end-to-end to learn the parameters of a planning computation that is relevant for the task at hand. We show that by learning such a planning computation, VIN policies generalize better to variations in the task and environment.

I will also discuss an extension of the VIN idea to continuous control, based on a model-predictive control (MPC) framework. The performance of this method will be demonstrated on learning object manipulation tasks using the PR2 robot.