In this paper, we investigate the conditions under which dynamic programming yields a solution to simultaneous learning and optimal control of a Markov decision process. First, we introduce a new optimality criterion that allows act-state dependence. This criterion is based on a partial preference ordering induced by an imprecise probability model of the dynamics of the system, updated by observations of the state and control history of the system. Then, we show that dynamic programming yields the set of all optimal solutions if the imprecise probability model satisfies particular properties. When we model learning of the system dynamics by an imprecise Dirichlet model, these properties turn out to be satisfied.
Financed by the National Centre for Research and Development under grant No. SP/I/1/77065/10 by the strategic scientific research and experimental development program:
SYNAT - “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.