Safety through feedback in constrained RL

In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the...

Full description

Saved in:

Bibliographic Details
Main Authors:	CHIRRA, Shashank Reddy, VARAKANTHAM, Pradeep, PARUCHURI, Praveen
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Reinforcement Learning Cost function Machine learning Feedback Artificial Intelligence and Robotics Computer Sciences
Online Access:	https://ink.library.smu.edu.sg/sis_research/9968 https://ink.library.smu.edu.sg/context/sis_research/article/10968/viewcontent/2406.19626v22.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10968
record_format	dspace
spelling	sg-smu-ink.sis_research-109682025-01-16T10:07:32Z Safety through feedback in constrained RL CHIRRA, Shashank Reddy VARAKANTHAM, Pradeep PARUCHURI, Praveen In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g., aggressive lane changes, risky overtakes) is inherently complex, it must also consider all the actors present in the scene making it expensive to evaluate. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback for every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce a \textit{novelty-based sampling} mechanism that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory, and discontinues querying once the trajectories are no longer \textit{novel}. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates near-optimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9968 https://ink.library.smu.edu.sg/context/sis_research/article/10968/viewcontent/2406.19626v22.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Reinforcement Learning Cost function Machine learning Feedback Artificial Intelligence and Robotics Computer Sciences
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Reinforcement Learning Cost function Machine learning Feedback Artificial Intelligence and Robotics Computer Sciences
spellingShingle	Reinforcement Learning Cost function Machine learning Feedback Artificial Intelligence and Robotics Computer Sciences CHIRRA, Shashank Reddy VARAKANTHAM, Pradeep PARUCHURI, Praveen Safety through feedback in constrained RL
description	In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g., aggressive lane changes, risky overtakes) is inherently complex, it must also consider all the actors present in the scene making it expensive to evaluate. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback for every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce a \textit{novelty-based sampling} mechanism that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory, and discontinues querying once the trajectories are no longer \textit{novel}. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates near-optimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach.
format	text
author	CHIRRA, Shashank Reddy VARAKANTHAM, Pradeep PARUCHURI, Praveen
author_facet	CHIRRA, Shashank Reddy VARAKANTHAM, Pradeep PARUCHURI, Praveen
author_sort	CHIRRA, Shashank Reddy
title	Safety through feedback in constrained RL
title_short	Safety through feedback in constrained RL
title_full	Safety through feedback in constrained RL
title_fullStr	Safety through feedback in constrained RL
title_full_unstemmed	Safety through feedback in constrained RL
title_sort	safety through feedback in constrained rl
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9968 https://ink.library.smu.edu.sg/context/sis_research/article/10968/viewcontent/2406.19626v22.pdf
_version_	1821833222446645248

Safety through feedback in constrained RL

Similar Items