Question: Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by = ( : , , , , : )

Problem
3.
(
50
pt
)
Consider an infinite horizon MDP
,
characterized by
=
(
:
,
,
,
,
:
)
and
:
\times
->
[
0
,
1
]
.
We would like to evaluate the value of a Markov stationary policy
:
->
(
)
.
However, we do not know the transition kernel
.
Rather than applying
a model
-
free approach, we decided to use a model
-
based approach where we first estimate
the underlying transition kernel by follow some fully stochastic policy in the MDP
(
for good
exploration
)
and observe the triples
(
,
,
+
1
)
\times
\times
for
=
0
,
1
,
dots. Let widehat
(
)
be our
estimate of
based on the data collected. Now, we can apply value iteration directly as if the
underlying MDP is widehat
(
)
=
(
:
,
,
,
widehat
(
)
,
:
)
and obtain widehat
(
)
.
Prove the simulation lemma bounding the difference between hat
(
)
and the true value of the
policy, denoted by
,
by showing that
|
(
0
)
-
widehat
(
)
(
0
)
|
<=
(
1
-
)
2
0
,
(
)
|
|
(
)
(
*
|
,
)
-
(
*
|
,
)
|
|
1
,
where
0
is the initial state and
0
is the discounted state visitation distribution under policy
.
Note that the difference
|
(
0
)
-
widehat
(
)
(
0
)
|
gets smaller with the smaller model approximation
error
|
|
(
)
(
*
|
,
)
-
(
*
|
,
)
|
|
1
.
However, the impact of model approximation error gets larger
with
~~
1
as the approximation error propagates more across stages.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!