Skip to Main content Skip to Navigation

Elicitation and planning in Markov decision processes with unknown rewards

Abstract : Markov decision processes (MDPs) are models for solving sequential decision problemswhere a user interacts with the environment and adapts her policy by taking numericalreward signals into account. The solution of an MDP reduces to formulate the userbehavior in the environment with a policy function that specifies which action to choose ineach situation. In many real world decision problems, the users have various preferences,and therefore, the gain of actions on states are different and should be re-decoded foreach user. In this dissertation, we are interested in solving MDPs for users with differentpreferences.We use a model named Vector-valued MDP (VMDP) with vector rewards. We propose apropagation-search algorithm that allows to assign a vector-value function to each policyand identify each user with a preference vector on the existing set of preferences wherethe preference vector satisfies the user priorities. Since the user preference vector is notknown we present several methods for solving VMDPs while approximating the user’spreference vector.We introduce two algorithms that reduce the number of queries needed to find the optimalpolicy of a user: 1) A propagation-search algorithm, where we propagate a setof possible optimal policies for the given MDP without knowing the user’s preferences.2) An interactive value iteration algorithm (IVI) on VMDPs, namely Advantage-basedValue Iteration (ABVI) algorithm that uses clustering and regrouping advantages. Wealso demonstrate how ABVI algorithm works properly for two different types of users:confident and uncertain.We finally work on a minimax regret approximation method as a method for findingthe optimal policy w.r.t the limited information about user’s preferences. All possibleobjectives in the system are just bounded between two higher and lower bounds while thesystem is not aware of user’s preferences among them. We propose an heuristic minimaxregret approximation method for solving MDPs with unknown rewards that is faster andless complex than the existing methods in the literature.
Document type :
Complete list of metadatas

Cited literature [57 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Friday, February 9, 2018 - 9:04:09 AM
Last modification on : Saturday, February 15, 2020 - 2:04:07 AM
Long-term archiving on: : Friday, May 4, 2018 - 1:16:22 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01705061, version 1



Pegah Alizadeh. Elicitation and planning in Markov decision processes with unknown rewards. Computers and Society [cs.CY]. Université Sorbonne Paris Cité, 2016. English. ⟨NNT : 2016USPCD011⟩. ⟨tel-01705061⟩



Record views


Files downloads