Safe reinforcement learning under temporal logic with reward design and quantum action selection

Mingyu Cai; Shaoping Xiao; Junchao Li; Zhen Kan

doi:10.1038/s41598-023-28582-4

Back

Safe reinforcement learning under temporal logic with reward design and quantum action selection

Journal article

Open access

Peer reviewed

Safe reinforcement learning under temporal logic with reward design and quantum action selection

Mingyu Cai, Shaoping Xiao, Junchao Li and Zhen Kan

Scientific reports, Vol.13(1), 1925

02/02/2023

DOI: 10.1038/s41598-023-28582-4

PMCID: PMC9894922

PMID: 36732441

Files and links (1)

url

https://doi.org/10.1038/s41598-023-28582-4View

Published (Version of record) Open Access

Abstract

This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is a promising approach, it suffers from unsafe traps and sparse rewards and becomes impractical when applied to real-world problems. To improve safety during training, we introduce a concept of safety values, which results in a model-based adaptive scenario due to online updates of transition probabilities. On the other hand, a high-level complex task is usually formulated via formal languages, including Linear Temporal Logic (LTL). Another novelty of this work is using an Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) to represent an LTL formula. The obtained deterministic policy can generalize the tasks over infinite and finite horizons. We design an automaton-based reward, and the theoretical analysis shows that an agent can accomplish task specifications with the maximum probability by following the optimal policy. Furthermore, a reward shaping process is developed to avoid sparse rewards and enforce the RL convergence while keeping the optimal policies invariant. In addition, inspired by quantum computing, we propose a quantum action selection algorithm to replace the existing [Formula: see text]-greedy algorithm for the balance of exploration and exploitation during learning. Simulations demonstrate how the proposed framework can achieve good performance by dramatically reducing the times to visit unsafe states while converging optimal policies.

Details

Title: Subtitle: Safe reinforcement learning under temporal logic with reward design and quantum action selection
Creators: Mingyu Cai - Lehigh University
Shaoping Xiao - Department of Mechanical Engineering, University of Iowa, 3131 Seamans Center, Iowa City, IA, 52242, USA. shaoping-xiao@uiowa.edu
Junchao Li - Department of Mechanical Engineering, University of Iowa, 3131 Seamans Center, Iowa City, IA, 52242, USA
Zhen Kan - Department of Automation, University of Science and Technology of China, 443 Huangshan Road, Hefei, 230026, Anhui, China
Resource Type: Journal article
Publication Details: Scientific reports, Vol.13(1), 1925
DOI: 10.1038/s41598-023-28582-4
PMID: 36732441
PMCID: PMC9894922
NLM abbreviation: Sci Rep
eISSN: 2045-2322
Grant note: ED#P116S210005 / US Department of Education
Language: English
Date published: 02/02/2023
Academic Unit: Iowa Technology Institute; Mechanical Engineering
Record Identifier: 9984364649902771

Metrics

17 Record Views

16 Times Cited - Web of Science