SMS scnews item created by Tiangang Cui at Tue 12 Aug 2025 1827
Type: Seminar
Distribution: World
Expiry: 12 Aug 2026
Calendar1: 22 Aug 2025 1400-1500
CalLoc1: Chemistry Lecture Theatre 1
CalTitle1: Bayesian learning of the optimal action-value function in a Markov decision process
Auth: tcui@ptcui.pc (assumed)

Statistics Seminar

Bayesian learning of the optimal action-value function in a Markov decision process

Singh

The next statistics seminar will be presented by Prof Sumeet Singh from the University of Wollongong.

Title: Bayesian learning of the optimal action-value function in a Markov decision process
Speaker: Prof Sumeet Singh
Time and location : 2-3pm in F11.01.145. Chemistry Lecture Theatre 1 or Zoom
Abstract :

The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about the optimal decisions and the unknown ingredients of the model, which are also to be learned from the data, such as the rewards and state dynamics.

We focus on infinite-horizon and undiscounted MDPs, with finite state and action spaces, and a terminal state. We provide a full Bayesian framework, from modelling to inference to decision-making. For modelling, we introduce a likelihood function with minimal assumptions for learning the optimal action-value function based on Bellman’s optimality equations, analyse its properties, and clarify connections to existing works. For deterministic rewards, the likelihood is degenerate and we introduce artificial observation noise to relax it, in a controlled manner, to facilitate more efficient Monte Carlo-based inference. For inference, we propose an adaptive sequential Monte Carlo algorithm to both sample from and adjust the sequence of relaxed posterior distributions. For decision-making, we choose actions using samples from the posterior distribution over the optimal strategies. While commonly done, we provide new insight that clearly shows that it is a generalisation of Thompson sampling from multi-arm bandit problems. Finally, we evaluate our framework on the Deep-Sea benchmark problem and demonstrate the exploration benefits of posterior sampling in MDPs.

Joint work with Jiaqi Guo (University of Cambridge) and Chon Wai Ho (University of Cambridge).

Actions:

Calendar (ICS file) download, for import into your favourite calendar application

UNCLUTTER for printing

AUTHENTICATE to mark the scnews item as read