In this paper, we investigate the conditions under which dynamic programming yields a solution to simultaneous learning and optimal control of a Markov decision process. First, we introduce a new optimality criterion that allows act-state dependence. This criterion is based on a partial preference ordering induced by an imprecise probability model of the dynamics of the system, updated by observations...
Financed by the National Centre for Research and Development under grant No. SP/I/1/77065/10 by the strategic scientific research and experimental development program:
SYNAT - “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.