Batch Policy Learning in Markov Decision Processes

Fri, 13 November, 2020 11:00am

Speaker: Zhengling Qi, The George Washington University

Abstract: In this talk, I will discuss the offline policy learning problem in infinite horizon Markov Decision Processes. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves statistical efficiency bound. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy. We establish a strong finite-sample regret guarantee, demonstrating that our proposed method can efficiently break the curse of horizon. The performance of the method is illustrated by simulation studies.


Share This Event