Self‐regulating action exploration in reinforcement learning
The basic tenet of a learning process is for an agent to learn for only as much and as long as it is necessary. With reinforcement learning, the learning process is divided between exploration and exploitation. Given the complexity of the problem domain and the randomness of the learning process, th...
Saved in:
Main Authors: | , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2012
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/5239 https://ink.library.smu.edu.sg/context/sis_research/article/6242/viewcontent/82448677.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-6242 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-62422020-07-23T18:25:02Z Self‐regulating action exploration in reinforcement learning TENG, Teck-Hou TAN, Ah-hwee TAN, Yuan-Sin The basic tenet of a learning process is for an agent to learn for only as much and as long as it is necessary. With reinforcement learning, the learning process is divided between exploration and exploitation. Given the complexity of the problem domain and the randomness of the learning process, the exact duration of the reinforcement learning process can never be known with certainty. Using an inaccurate number of training iterations leads either to the non-convergence or the over-training of the learning agent. This work addresses such issues by proposing a technique to self-regulate the exploration rate and training duration leading to convergence efficiently. The idea originates from an intuitive understanding that exploration is only necessary when the success rate is low. This means the rate of exploration should be conducted in inverse proportion to the rate of success. In addition, the change in exploration-exploitation rates alters the duration of the learning process. Using this approach, the duration of the learning process becomes adaptive to the updated status of the learning process. Experimental results from the K-Armed Bandit and Air Combat Maneuver scenario prove that optimal action policies can be discovered using the right amount of training iterations. In essence, the proposed method eliminates the guesswork on the amount of exploration needed during reinforcement learning. 2012-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5239 info:doi/10.1016/j.procs.2012.09.110 https://ink.library.smu.edu.sg/context/sis_research/article/6242/viewcontent/82448677.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Reinforcement learning Exploration-exploitation dilemma k-armed bandit Pursuit-evasion Self-organizing neural network Computer Engineering Databases and Information Systems OS and Networks |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Reinforcement learning Exploration-exploitation dilemma k-armed bandit Pursuit-evasion Self-organizing neural network Computer Engineering Databases and Information Systems OS and Networks |
spellingShingle |
Reinforcement learning Exploration-exploitation dilemma k-armed bandit Pursuit-evasion Self-organizing neural network Computer Engineering Databases and Information Systems OS and Networks TENG, Teck-Hou TAN, Ah-hwee TAN, Yuan-Sin Self‐regulating action exploration in reinforcement learning |
description |
The basic tenet of a learning process is for an agent to learn for only as much and as long as it is necessary. With reinforcement learning, the learning process is divided between exploration and exploitation. Given the complexity of the problem domain and the randomness of the learning process, the exact duration of the reinforcement learning process can never be known with certainty. Using an inaccurate number of training iterations leads either to the non-convergence or the over-training of the learning agent. This work addresses such issues by proposing a technique to self-regulate the exploration rate and training duration leading to convergence efficiently. The idea originates from an intuitive understanding that exploration is only necessary when the success rate is low. This means the rate of exploration should be conducted in inverse proportion to the rate of success. In addition, the change in exploration-exploitation rates alters the duration of the learning process. Using this approach, the duration of the learning process becomes adaptive to the updated status of the learning process. Experimental results from the K-Armed Bandit and Air Combat Maneuver scenario prove that optimal action policies can be discovered using the right amount of training iterations. In essence, the proposed method eliminates the guesswork on the amount of exploration needed during reinforcement learning. |
format |
text |
author |
TENG, Teck-Hou TAN, Ah-hwee TAN, Yuan-Sin |
author_facet |
TENG, Teck-Hou TAN, Ah-hwee TAN, Yuan-Sin |
author_sort |
TENG, Teck-Hou |
title |
Self‐regulating action exploration in reinforcement learning |
title_short |
Self‐regulating action exploration in reinforcement learning |
title_full |
Self‐regulating action exploration in reinforcement learning |
title_fullStr |
Self‐regulating action exploration in reinforcement learning |
title_full_unstemmed |
Self‐regulating action exploration in reinforcement learning |
title_sort |
self‐regulating action exploration in reinforcement learning |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2012 |
url |
https://ink.library.smu.edu.sg/sis_research/5239 https://ink.library.smu.edu.sg/context/sis_research/article/6242/viewcontent/82448677.pdf |
_version_ |
1770575345870700544 |