Stochastic Constrained Contextual Bandits via Lyapunov Optimization Based Estimation to Decision Framework

Hengquan, Guo; Liu, Xin

Session: 2C-1 (Bandits), Monday, Jul 01, 15:00-16:15

Abstract: This paper studies the problem of stochastic constrained contextual bandits (CCB) under general realizability condition where the expected rewards and costs are within general function classes. We propose LOE2D, a Lyapunov Optimization Based Estimation to Decision framework with online regression oracles for learning reward/constraint. LOE2D establishes $O(T^{3/4})$ regret and constraint violation, which can be further refined to $O(\min\{\sqrt{T}/\varepsilon^2,T^{{3}/{4}}\})$ when the Slater condition holds in the underlying offline problem with the Slater ``constant'' $\varepsilon=\Omega(T^{-1/8})$. These results improve \cite{SliSanFos_23} in two aspects: i) our results hold without any prior information while \cite{SliSanFos_23} requires the knowledge of Slater constant to design a proper learning rate; ii) our results hold when $\varepsilon=\Omega(T^{-1/8})$ while \cite{SliSanFos_23} requires a constant margin $\varepsilon=\Omega(1).$ These improvements stem from two novel techniques: violation-adaptive learning in E2D module and multi-step Lyapunov drift analysis in bounding constraint violation. The experiments further justify LOE2D outperforms the baseline algorithm.