Safety through feedback in constrained RL

In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the...

Full description

Saved in:
Bibliographic Details
Main Authors: CHIRRA, Shashank Reddy, VARAKANTHAM, Pradeep, PARUCHURI, Praveen
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9968
https://ink.library.smu.edu.sg/context/sis_research/article/10968/viewcontent/2406.19626v22.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10968
record_format dspace
spelling sg-smu-ink.sis_research-109682025-01-16T10:07:32Z Safety through feedback in constrained RL CHIRRA, Shashank Reddy VARAKANTHAM, Pradeep PARUCHURI, Praveen In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g., aggressive lane changes, risky overtakes) is inherently complex, it must also consider all the actors present in the scene making it expensive to evaluate. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback for every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce a \textit{novelty-based sampling} mechanism that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory, and discontinues querying once the trajectories are no longer \textit{novel}. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates near-optimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9968 https://ink.library.smu.edu.sg/context/sis_research/article/10968/viewcontent/2406.19626v22.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Reinforcement Learning Cost function Machine learning Feedback Artificial Intelligence and Robotics Computer Sciences
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Reinforcement Learning
Cost function
Machine learning
Feedback
Artificial Intelligence and Robotics
Computer Sciences
spellingShingle Reinforcement Learning
Cost function
Machine learning
Feedback
Artificial Intelligence and Robotics
Computer Sciences
CHIRRA, Shashank Reddy
VARAKANTHAM, Pradeep
PARUCHURI, Praveen
Safety through feedback in constrained RL
description In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g., aggressive lane changes, risky overtakes) is inherently complex, it must also consider all the actors present in the scene making it expensive to evaluate. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback for every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce a \textit{novelty-based sampling} mechanism that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory, and discontinues querying once the trajectories are no longer \textit{novel}. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates near-optimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach.
format text
author CHIRRA, Shashank Reddy
VARAKANTHAM, Pradeep
PARUCHURI, Praveen
author_facet CHIRRA, Shashank Reddy
VARAKANTHAM, Pradeep
PARUCHURI, Praveen
author_sort CHIRRA, Shashank Reddy
title Safety through feedback in constrained RL
title_short Safety through feedback in constrained RL
title_full Safety through feedback in constrained RL
title_fullStr Safety through feedback in constrained RL
title_full_unstemmed Safety through feedback in constrained RL
title_sort safety through feedback in constrained rl
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9968
https://ink.library.smu.edu.sg/context/sis_research/article/10968/viewcontent/2406.19626v22.pdf
_version_ 1821833222446645248