Fairbank, Michael (2014). ValueGradient Learning. (Unpublished Doctoral thesis, City University London)

PDF
Download (2MB)  Preview 
Abstract
This thesis presents an Adaptive Dynamic Programming method, ValueGradient Learning, for solving a control optimisation problem, using a neural network to represent a critic function in a large continuousvalued state space. The algorithm developed, called VGL(λ), requires a learned differentiable model of the environment. VGL(λ) is an extension of Dual Heuristic Programming (DHP) to include a bootstrapping parameter, λ, analogous to that used in the reinforcement learning algorithm TD(λ). Online and batchmode implementations of the algorithm are provided, and its theoretical relationships to its precursor algorithms, DHP and TD(λ), are described.
A theoretical result is given which shows that to achieve trajectory optimality in a continuousvalued state space, the critic must learn the valuegradient, and this fact affects any criticlearning algorithm. The connection of this result to Pontryagin's Minimum Principle is made clear. Hence it is proven that learning this valuegradient directly will obviate the need for local exploration of the value function, and this motivates valuegradient learning methods in terms of automatic local value exploration and improved learning speed. Empirical results for the algorithm are given for several benchmark problems, and the improved speed, convergence, and ability to work without local value exploration, is demonstrated in comparison to its precursor algorithms, TD(λ) and DHP.
A convergence proof for one instance of the VGL(λ) algorithm is given, which is valid for control problems with a greedy policy, and a general nonlinear function approximator to represent the critic. This is a nontrivial accomplishment, since most or all other related algorithms can be made to diverge under similar conditions, and new divergence proofs demonstrating this for certain algorithms are given in the thesis.
Several technical problems must be overcome to make a robust VGL(λ) implementation, and these solutions are described. These include implementing an efficient greedy policy, implementing trajectory clipping correctly, and the efficient computation of secondorder gradients with a neural network.
Item Type:  Thesis (Doctoral) 

Subjects:  Q Science > QA Mathematics > QA75 Electronic computers. Computer science 
Divisions:  School of Informatics > Department of Computing City University London PhD theses 
URI:  http://openaccess.city.ac.uk/id/eprint/3438 
Actions (login required)
View Item 
Downloads
Downloads per month over past year