Q5 Model-Free RL: Cycle 12 Points
We recommend you work out the solutions to the following questions on a sheet of scratch paper, and then enter your results into the answer boxes.
Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning.
Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5.
Our current Q function, Q(s, a), is as follows.
A B с
Clockwise 1.501 -0.451 2.73
Counterclockwise 3.153 -6.055 2.133
The agent encounters the following samples.
a s' r
A Counterclockwise C 8.0
C Counterclockwise A 0.0
Process the samples given above. Below fill in the Q-values after both samples have been accounted for
Q(A, clockwise) Enter your answer here QIA, counterclockwise)
Q(A, clockwise) Enter your answer here
QIA, counterclockwise) Enter your answer here
Q(B, clockwise) Enter your answer here
Q(B, counterclockwise) Enter your answer here
Q(C, clockwise) Enter your answer here
Q(C, counterclockwise)