Page History: Reinforcement learning with functional representation

Compare Page Revisions

« Older Revision - Back to Page History - Newer Revision »

Page Revision: 2014/02/13 01:42

In reinforcement learning, an intelligent and autonomous agent is put in an environment. It observes its state of the environment and take actions. After taking every action, the agent receives an reward from the environment and meanwhile changes its state. The aim of the agent is to learn a state-to-action mapping from the experience of its state-action-reward history, so that its accumulated long-term reward is maximized. The state-to-action mapping is usually called as a policy.

In real-world situations, such as driving vehicles and manipulating robotic hands, the mapping between the states to actions is commonly highly complex and hard to be linear. We expect that the agent can adaptively learn a policy to fit the complex situations. Functional representation, by which a function is represented as a combination of basis functions, is a powerful tool for learning non-linear functions.

Napping

Qing Da, Yang Yu, and Zhi-Hua Zhou. Napping for functional representation of policy. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS'14), Paris, France, 2014.

A practical difficulty of learning functional represented policy is that it involves training and accumulating a lot of basis functions, all of which have to be invoked in every calculation of the policy output. Since the policy has to be repeatedly evaluated during both training and prediction stages of the reinforcement learning, learning functional representation policy suffers from a very large time cost for calculating every constituting basis functions. Thus we proposed the napping mechanism to reduce the time cost of using functional represented policy in reinforcement learning. The idea is to replace the learned function by a simple approximation function periodically along with the learning process. For a given policy formed by a set of models, an approximation model is obtained by mimicking the input-output behavior of the policy.

The codes used in the experiments of the above paper can be downloaded here, where napping is implemented into the Non-Parametric Policy Gradient (NPPG) method.

The source files are standalone JAVA codes only except the WEKA 3.6 package, and are currently not with RL-Glue.
Note that we did not obtained the NPPG codes from the original authors, as their codes seemed to have been lost. We thus had to make our own implementation. No guarantee that the NPPG codes will reproduce the performance as in the original paper (Kersting&Driessens, ICML'08).
The codes are released under the GNU GPL 2.0 license. For commercial purpose, please contact me.