Convex synthesis of optimal policies for Markov Decision Processes with sequentially-observed transitions

Mahmoud El Chamie; Behcet Acikmese

doi:10.1109/ACC.2016.7525515

Convex synthesis of optimal policies for Markov Decision Processes with sequentially-observed transitions

Source

2016 American Control Conference (ACC) > 3862 - 3867

Abstract

This paper extends finite state and action space Markov Decision Process (MDP) models by introducing a new type of measurement for the outcomes of actions. The new measurement allows to sequentially observe the next-state transition for taking an action, i.e., the actions are ordered and the next action outcome in the sequence is observed only if the current action is not chosen. The sequentially-observed MDP (SO-MDP) shares some properties with a standard MDP: among history dependent policies, Markovian ones are still optimal. SO-MDP policies have the advantage of producing better rewards than standard optimal MDP policies due to additional measurements. Computing these policies, on the other hand, is more complex and we present a linear programming based synthesis of the optimal decision policies for the finite horizon SO-MDPs. A simulation example of multiple autonomous agents is also provided to demonstrate the SO-MDP model and the proposed policy synthesis method.