Question: A Bandit Model: Suppose there are two projects available for selection in each of three period: Project 1 yields a reward of one unit and
A Bandit Model: Suppose there are two projects available for selection in each of three period:
- Project 1 yields a reward of one unit and always occupies state s.
- Project 2 occupies either state u or state t.
- Project 2 selected in state u yields a reward of 2 and moves to state t at the next decision epoch with probability 0.5; to state u at the next decision epoch with probability 0.5.
- Project 2 selected in state t yields a reward of 0 and moves to state u at the next decision epoch with probability 1.
Assume that a terminal reward of 0, and that project 2 does not change state when it is not selected.
Using backward induction method determine a strategy that maximizes the expected total reward. This question has bonus points. See grade distribution below:
- Description of Markov Decision Process (2 points)
- Description of Reward and Transition Probability Matrices (2 points)
- Backward Induction Step 1 and Step 2 (3 points)
- Bonus: Finding the strategy that maximizes the expected total reward using Backward Induction will give you
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
