Problem 3. (50pt) Consider an infinite horizon MDP, characterized by M = (S, A, r,p,y) and...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Problem 3. (50pt) Consider an infinite horizon MDP, characterized by M = (S, A, r,p,y) and r : S A [0,1]. We would like to evaluate the value of a Markov stationary policy : S A(A). However, we do not know the transition kernel p. Rather than applying a model-free approach, we decided to use a model-based approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP (for good exploration) and observe the triples (Sk, ak, Sk+1) S A S for k 0,1,.... Let be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is M = (S, A, r,p,y) and obtain T. = Prove the simulation lemma bounding the difference between and the true value of the policy, denoted by v", by showing that v (so) (so)| < (1) Es~d~(s) ||( | s, a) p( | s, a)||1, where so is the initial state and do is the discounted state visitation distribution under policy . Note that the difference |vT (80) (so)| gets smaller with the smaller model approximation error ||( | s, a) p( | s, a)||1. However, the impact of model approximation error gets larger with 1 as the approximation error propagates more across stages. Problem 3. (50pt) Consider an infinite horizon MDP, characterized by M = (S, A, r,p,y) and r : S A [0,1]. We would like to evaluate the value of a Markov stationary policy : S A(A). However, we do not know the transition kernel p. Rather than applying a model-free approach, we decided to use a model-based approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP (for good exploration) and observe the triples (Sk, ak, Sk+1) S A S for k 0,1,.... Let be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is M = (S, A, r,p,y) and obtain T. = Prove the simulation lemma bounding the difference between and the true value of the policy, denoted by v", by showing that v (so) (so)| < (1) Es~d~(s) ||( | s, a) p( | s, a)||1, where so is the initial state and do is the discounted state visitation distribution under policy . Note that the difference |vT (80) (so)| gets smaller with the smaller model approximation error ||( | s, a) p( | s, a)||1. However, the impact of model approximation error gets larger with 1 as the approximation error propagates more across stages.
Expert Answer:
Posted Date:
Students also viewed these mathematics questions
-
In Exercises confirm that the Integral Test can be applied to the series. Then use the Integral Test to determine the convergence or divergence of the series. n=1 2 3n + 5
-
You have decided to buy your dream car that costs $40,000. You can afford $600 monthly payments at the loan rate of 9% over the 5 years of the loan. If your savings earn 4% interest annually, how...
-
At the end of 2019, the accounting firm for which you work is auditing the books of Debitus Publishing Inc. for the first time. Debitus, a calendar year company, publishes textbooks that are used in...
-
Clarkson Company is a large multi-division firm with several plants in each division. A comprehensive budgeting system is used for planning operations and measuring performance. The annual budgeting...
-
The KLM Christmas Tree Farm owns a plot of land with 5000 evergreen trees. Each year KLM allows retailers of Christmas trees to select and cut trees for sale to individual customers. KLM protects...
-
Opinion modified as a result of a scope limitation Opinion modified as a result of a material misstatement Unmodified opinion Disclaimer of opinion as a result of a scope limitation or Adverse...
-
According to this article, what ABA rules do you consider would apply to this case, and what analytical questions would be important to analyze in the ethical duty of lawyers and judges. Since I did...
-
Ted, a new client, has listened to your description of a participating whole life policy. However, he is unclear as to what "whole life" means and how turning 100 will affect the premium payments....
-
Thinking of different types of group roles (task roles, relationship roles, dysfunctional roles) as well as group roles classification (you can use BelbinTeam Roles or any other classification of...
-
What are the groups purpose or membership, roles. speaker's role within the groups, the norms, the dynamics like how does the groups communicate, how are decisions made & who evaluates the groups...
-
Healthcare team members from diverse background have assembled to evaluate the outcome of a recent sentinel event. prior to presenting the case, the team leader develops a series of social games for...
-
Provide an analysis of the four disciplines that comprise the discipline of Organizational Behavior. Evaluate each one on the basis of its impact in the study of Organizational Behavior.
-
The next two questions (38 and 39) refer to the following: The time it takes Stana to get ready in the morning follows a normal distribution with a mean of 45 minutes and a standard deviation of 8...
-
Diamond Walker sells homemade knit scarves for $25 each at local craft shows. Her contribution margin ratio is 60%. Currently, the craft show entrance fees cost Diamond $1,500 per year. The craft...
Study smarter with the SolutionInn App