In this project, you will work towards constructing an optimized Q-Learning driving agent that will navigate a Smartcab through its environment towards a goal. Since the Smartcab is expected to drive passengers from one location to another, the driving agent will be evaluated on two very important metrics: Safety and Reliability. A driving agent that gets the Smartcab to its destination while running red lights or narrowly avoiding accidents would be considered unsafe. Similarly, a driving agent that frequently fails to reach the destination in time would be considered unreliable. Maximizing the driving agent's safety and reliability would ensure that Smartcabs have a permanent place in the transportation industry.
Safety and Reliability are measured using a letter-grade system as follows:
Grade | Safety | Reliability |
---|---|---|
A+ | Agent commits no traffic violations, and always chooses the correct action. |
Agent reaches the destination in time for 100% of trips. |
A | Agent commits few minor traffic violations, such as failing to move on a green light. |
Agent reaches the destination on time for at least 90% of trips. |
B | Agent commits frequent minor traffic violations, such as failing to move on a green light. |
Agent reaches the destination on time for at least 80% of trips. |
C | Agent commits at least one major traffic violation, such as driving through a red light. |
Agent reaches the destination on time for at least 70% of trips. |
D | Agent causes at least one minor accident, such as turning left on green with oncoming traffic. |
Agent reaches the destination on time for at least 60% of trips. |
F | Agent causes at least one major accident, such as driving through a red light with cross-traffic. |
Agent fails to reach the destination on time for at least 60% of trips. |
To assist evaluating these important metrics, you will need to load visualization code that will be used later on in the project. Run the code cell below to import this code which is required for your analysis.
# Import the visualization code
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
Before
starting to work on implementing your driving agent, it's necessary to
first understand the world (environment) which the Smartcab and
driving agent work in. One of the major components to building a
self-learning agent is understanding the characteristics about the
agent, which includes how the agent operates. To begin, simply run the agent.py
agent code exactly how it is -- no need to make any additions
whatsoever. Let the resulting simulation run for some time to see the
various working components. Note that in the visual simulation (if
enabled), the white vehicle is the Smartcab.
In a few sentences, describe what you observe during the simulation when running the default agent.py
agent code. Some things you could consider:
Hint: From the /smartcab/
top-level directory (where this notebook is located), run the command
'python smartcab/agent.py'
Answer:
The Smartcab remains idle through the entire simulation. The agent is stopped at a light and the simulation runs between red and green lights. The simulation given does not move the Smartcab.
Becaue the cab is waiting at a light, the rewards are given based on the action according to the light. There are also other cars and pedestrians moving while the Smartcab remains idle and the cab is rewarded based on the safety and reliability metrics outlined above.
When the light turns green, the cab is expected to move according to the traffic laws. The Smartcab is rewarded a negative weight if the idle position is not the correct response to a green traffic light. The Smartcab is rewarded a postivie weight if the the traffic light is red and it is the correct response according to the rules set in the simulation.
In
addition to understanding the world, it is also necessary to understand
the code itself that governs how the world, simulation, and so on
operate. Attempting to create a driving agent would be difficult without
having at least explored the "hidden" devices that make everything work. In the /smartcab/
top-level directory, there are two folders: /logs/
(which will be used later) and /smartcab/
. Open the /smartcab/
folder and explore each Python file included, then answer the following question.
agent.py
Python file, choose three flags that can be set and explain how they change the simulation.environment.py
Python file, what Environment class function is called when an agent performs an action?simulator.py
Python file, what is the difference between the 'render_text()'
function and the 'render()'
function?planner.py
Python file, will the 'next_waypoint()
function consider the North-South or East-West direction first?Answer:
num_dummies
¶The num_dummies
flag changes the number of dummy agents in the enviornment where the
default is 100. This can change the safety and reliability measurements
because there is a positive correlation with the number of dummy agents
and the chances of accidents. With a high number, the Smartcab is more
prone to get in an accident while a low number might not give enough
learning opportunities in the simulation to reinforce desired outcomes.
enforce_deadline
¶The enforce_deadline
flag will set a timer for the Smartcab to reach the desitination. The
timer can then be analyzed against different parameters within our
simulation to achieve desired outcomes. If we want to focus on
reliability, it is essential that the Smartcab reaches the destination
on time.
n_test
¶The n_test
flag sets the number of testing trials to perform where the default is
0. This can be used to test our Smartcab once it has been trained in
different enviornments and we can analyze outcomes. We can adjust the
reinforcement weights, number of dummy agents, epsilon, alpha, and
tolerance to adjust the agent and enviornment to produce an optimized
outcome based on the tests.
Within the Enviornment class the act() function is called when an agent performs an action. The function will first conside the action and evaulate it based on the traffic laws then perform the called action. Once it has completed an action, it will receive a reward for the action. This function takes three arguments: self, agent, action where valid actions are either None, forward, left, or right.
There are two types of output for the simulation to be user friendly. There is the render() function which will use the pygame
library to create a graphical user interface for the simnulation and the render_text()
function which is used for console output. The render_text() function
provides updates to the enviornment and the rewards at every step while
the render() function does this through a graphical user interface
(GUI).
The planner.py
file considers the enviornment based on a cartesian graph with x and y coordinates. Within the file, the next_waypoint()
function utilizes the x and y axis which can be abstracted to East-West
and North-South. The function will first determine the difference in
location from the current location to the destiation and then sets up
flags to see if there are differences in the x an y coordinates. The x
coordinate is checked first on line 39 which can be abstracted to the
East-West direction while the y coordinate is checked last at line 59
which is abstracted to the North-South direction.
The
first step to creating an optimized Q-Learning driving agent is getting
the agent to actually take valid actions. In this case, a valid action
is one of None
, (do nothing) 'left'
(turn left), right'
(turn right), or 'forward'
(go forward). For your first implementation, navigate to the 'choose_action()'
agent function and make the driving agent randomly choose one of these
actions. Note that you have access to several class variables that will
help you write this functionality, such as 'self.learning'
and 'self.valid_actions'
.
Once implemented, run the agent file and simulation briefly to confirm
that your driving agent is taking a random action each time step.
To obtain results from the initial simulation, you will need to adjust following flags:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file in /logs/
.'n_test'
- Set this to '10'
to perform 10 testing trials.Optionally, you may disable to the visual simulation (which can make the trials go faster) by setting the 'display'
flag to False
.
Flags that have been set here should be returned to their default
setting when debugging. It is important that you understand what each
flag does and how it affects the simulation!
Once you have successfully completed the initial simulation (there should have been 20 training trials and 10 testing trials), run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded! Run the agent.py file after setting the flags from projects/smartcab folder instead of projects/smartcab/smartcab.
# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')
Using the visualization above that was produced from your initial simulation, provide an analysis and make several observations about the driving agent. Be sure that you are making at least one observation about each panel present in the visualization. Some things you could consider:
Answer:
Based on the panel named 10-Trial Rolling Relative Frequency of Bad Actions the relative frequency of the total bad actions neared 0.41 or 41% after 20 trials. This can be explained because learning was turned off for this agent and the bad decisions were not negatively reinforced. The relative frequency for the number of accidents was around 0.05 or 5% after 20 trials. The majority of accidents were major and the frequency of minor accidents was low.
The rates of reliability are reasonable given that the agent was driving randomly. This can also be seen in the relative freqencies with no slope after iterating through trials. Without learning, the agent will maintain the same model and the rate of reliability will not change.
Due to the high rate of accidents, this agent is receiving an negative average of rewards. This agent was penalized throughout the trials because it wasn't able to update the model through machine learning. The agent committed an increasing frequency of Total Bad Actions and the reward suggest it was negatively reinforced with greater magnitude than before.
The outcomes don't change as the number of trials increases because the agent does not learn. The model the agent uses is constant throughout the trials and the enviornment is also constant. With a constant model and enviornment we can expect there to be no change in reliability or safety outcomes.
No, the current Smartcab without learning cannot be considered safe or reliable for any passenger. Because the model does not learn it follows the same rules and methods which have resulted in an F safety and reliability rating. The agent may be able to improve but without a learning model it was no way of changing in the simulation.
The
second step to creating an optimized Q-learning driving agent is
defining a set of states that the agent can occupy in the environment.
Depending on the input, sensory data, and additional variables available
to the driving agent, a set of states can be defined for the agent so
that it can eventually learn what action it should take when occupying a state. The condition of 'if state then action'
for each state is called a policy,
and is ultimately what the driving agent is expected to learn. Without
defining states, the driving agent would never understand which action
is most optimal -- or even what environmental variables and conditions
it cares about!
Inspecting the 'build_state()'
agent function shows that the driving agent is given the following data from the environment:
'waypoint'
, which is the direction the Smartcab should drive leading to the destination, relative to the Smartcab's heading.'inputs'
, which is the sensor data from the Smartcab. It includes 'light'
, the color of the light.'left'
, the intended direction of travel for a vehicle to the Smartcab's left. Returns None
if no vehicle is present.'right'
, the intended direction of travel for a vehicle to the Smartcab's right. Returns None
if no vehicle is present.'oncoming'
, the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None
if no vehicle is present.'deadline'
, which is the number of actions remaining for the Smartcab to reach the destination before running out of time.Which features available to the agent are most relevant for learning both safety and efficiency? Why are these features appropriate for modeling the Smartcab in the environment? If you did not choose some features, why are those features not appropriate? Please note that whatever features you eventually choose for your agent's state, must be argued for here. That is: your code in agent.py should reflect the features chosen in this answer.
NOTE: You are not allowed to engineer new features for the smartcab.
Answer:
The features most relevant for safety include all of the inputs including 'light', 'left', 'right', and 'oncoming'. These are safety guidelines that inform the agent about the current enviornment and other vehicles. The 'light' and 'oncoming' features are relevant to safety at intersections and traffic stops. The 'left' and 'right' features are more relevant while the vehicle is in motion. The 'right' feature was not included because the enviornment and traffic laws do not consider this an invalid move including at a 'red' light.
The feature most relevant for efficiency is waypoint. This feature can allow the agent to adjust the Smartcab based on destination and allows reinforcement learning for efficiency to occur. A learning model is either negatively or positively reinforced relative to the waypoint target. The waypoint is necessary for efficiency in combination with a deadline to allow the agent to act in the intended way.
I chose to include all features with my model except for right and deadline because of the relevance and to preserve a reasonable state space. The deadline feature has too many possible values and can have a disproportionate affect on the state space. Out of all other features, deadline has the most possible features which will increase the state space the most. From not considering deadline, we're able to significantly reduce the state space while maintaining relevancy for our features. The right feature was not relevant in terms of traffic laws and was not included in the Q-Learning model for the Smartcab.
When defining a set of states that the agent can occupy, it is necessary to consider the size of the state space. That is to say, if you expect the driving agent to learn a policy for each state, you would need to have an optimal action for every state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions. For example, consider a case where the following features are used to define the state of the Smartcab:
('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day')
.
How frequently would the agent occupy a state like (False, True, True, True, False, False, '3AM')
? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!
If a state is defined using the features you've selected from Question 4,
what would be the size of the state space? Given what you know about
the environment and how it is simulated, do you think the driving agent
could learn a policy for each possible state within a reasonable number
of training trials?
Hint: Consider the combinations of features to calculate the total number of states!
Answer:
The number of possible states is the count of all possible combinations of our features. I chose 4 features to balance safety, reliability, and simulation outcomes. Out of the 4, 2 were based on direction with 4 possible values, 'light', is binary with either red or green, and waypoint has 3 possible directional values. The total number of combinations is 96 or 4 x 4 x 3 x 2. If a trial has 20 steps, there are a possible 400 steps that it can learn from 20 trials. This is reasonable for the agent to train with 96 possible states or 24% of the total amount of possible training inputs based on an average of 20 steps per trial with 20 trials. As we will see in the upcoming section, the final Q-Learning models needs 300 trials before reaching the treshold meaning the state space will occupy approximately 2% based on an average of 20 steps per trial.
For your second implementation, navigate to the 'build_state()'
agent function. With the justification you've provided in Question 4, you will now set the 'state'
variable to a tuple of all the features necessary for Q-Learning.
Confirm your driving agent is updating its state by running the agent
file and simulation briefly and note whether the state is displaying. If
the visual simulation is used, confirm that the updated state
corresponds with what is seen in the simulation.
Note: Remember to reset simulation flags to their default setting when making this observation!
The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible. For this project, you will be implementing a decaying, -greedy Q-learning algorithm with no discount factor. Follow the implementation instructions under each TODO in the agent functions.
Note that the agent attribute self.Q
is a dictionary: This is how the Q-table will be formed. Each state will be a key of the self.Q
dictionary, and each value will then be another dictionary that holds the action and Q-value. Here is an example:
{ 'state-1': {
'action-1' : Qvalue-1,
'action-2' : Qvalue-2,
...
},
'state-2': {
'action-1' : Qvalue-1,
...
},
...
}
Furthermore, note that you are expected to use a decaying (exploration) factor. Hence, as the number of trials increases, should decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior. Additionally, The agent will be tested on what it has learned after has passed a certain threshold (the default threshold is 0.05). For the initial Q-Learning implementation, you will be implementing a linear decaying function for .
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file and the Q-table as a .txt
file in /logs/
.'n_test'
- Set this to '10'
to perform 10 testing trials.'learning'
- Set this to 'True'
to tell the driving agent to use your Q-Learning implementation.In addition, use the following decay function for :
If you have difficulty getting your implementation to work, try setting the 'verbose'
flag to True
to help debug. Flags that have been set here should be returned to
their default setting when debugging. It is important that you
understand what each flag does and how it affects the simulation!
Once you have successfully completed the initial Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!
# Load the 'sim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('sim_default-learning.csv')
Using the visualization above that was produced from your default Q-Learning simulation, provide an analysis and make observations about the driving agent like in Question 3. Note that the simulation should have also produced the Q-table in a text file which can help you make observations about the agent's learning. Some additional things you could consider:
Answer:
The preliminary trials showed similar results compared to the basic driving agent. This is because the default Q-Learning agent has not undergone reinforcement learning at the first trial and has a model identical to the basic driving agent. One similarity included more major traffic violations than minor traffic violations with both types of agents that can be due to the dummy agents and the enviornment.
The number of training trials before the driving agent had a suitable model was around 20. After 20 iterations, the reliability of the Smartcab was greater than 90% and had lower incidents of traffic violations and accidents. The decay rate was set to decline linearly at -0.05 with an initial value of 1. This means that after 20 trials, the Smartcab stopped performing random moves. This makes sense with the number of training trials it requires for testing.
The decay function was set to linearly decline at a rate of -0.05 per trial and after 20 trials it should have a value of 0. The paramater appropriately represented both characteristics of the decay function described. The learning factor was held constant.
As expected, the simulation negatively reinforced the Smartcab's model of the enviornment which caused a reduced number of bad actions. This allowed for less violations and accidents. With the lower amount of negative reinforcement the simulation began having a positive reinforcement value for the agent on average after consistently negatively reincorcing bad actions.
The safety rating of the Smartcab stayed the same while the reliability dramatically increased from an F to an A. A rating of F means it met the goal less than 60% of the time while a rating of A means it met the goal more than 90% of the time. This indicated that the safety rating of the agent needs more training to reach an acceptable rating. The relative frequency of violations and accidents decreased compared to the initial driving agent but the total rating did not change.
The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency. Typically this step will require a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if and never decays) will certainly make it learn, but never let it act. When improving on your Q-Learning implementation, consider the implications it creates and whether it is logistically sensible to make a particular adjustment.
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file and the Q-table as a .txt
file in /logs/
.'learning'
- Set this to 'True'
to tell the driving agent to use your Q-Learning implementation.'optimized'
- Set this to 'True'
to tell the driving agent you are performing an optimized version of the Q-Learning implementation.Additional flags that can be adjusted as part of optimizing the Q-Learning agent:
'n_test'
- Set this to some positive number (previously 10) to perform that many testing trials.'alpha'
- Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.'epsilon'
- Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.'tolerance'
- set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.Furthermore, use a decaying function of your choice for (the exploration factor). Note that whichever function you use, it must decay to 'tolerance'
at a reasonable rate. The Q-Learning agent will not begin testing until this occurs. Some example decaying functions (for , the number of trials):
You may also use a decaying function for (the learning rate) if you so choose, however this is typically less common. If you do so, be sure that it adheres to the inequality .
If you have difficulty getting your implementation to work, try setting the 'verbose'
flag to True
to help debug. Flags that have been set here should be returned to
their default setting when debugging. It is important that you
understand what each flag does and how it affects the simulation!
Once you have successfully completed the improved Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!
# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')
Using the visualization above that was produced from your improved Q-Learning simulation, provide a final analysis and make observations about the improved driving agent like in Question 6. Questions you should answer:
Answer:
Decay Function:
where a is alpha and t is training index
The threshold was defaulted to 0.05 which meant a total of 300 training trials before testing based on a learning rate of 0.01. This provided reasonable computation time and produced favorable outcomes. The model was able to thoroughly learn the simulation before numerous testing trials.
The learning rate or alpha was set to 0.01 to allow for a slower learning rate. The safety rating was low and required additional training while still maintaining the high reliability rating of a faster learning rate. This meant that the Smartcab was able to learn a model that was safe while still reliable.
There was significant improvement made from the default Q-Learner that failed the safety rating but aced the reliability. The improved Q-Learner was able to produce a safety rating of A+ while still maintaining an A for the reliability. The most improvement was made on the safety rating.
The Q-Learner results indicate good to excellent performance on both safety and reliability. The safety rating was able to be perfect (A+) while the reliability was nearly perfect (A). The model was thoroughly tested on 50 testing runs and the results prove that the Smartcab learned an appropritate policy.
Both the safety and reliability ratings of the Smartcab are satisfied and the policies can produce efficient, safe, and reliable outcomes in the simulation. The safety of the Smartcab during the testing trials was perfect with an A+ rating while the reliability reached the goal more than 90% of the time. The reliability of the Smartcab could be made perfect by including additional factors such as additional inputs which was originally omitted to allow for a reasonable set space.
Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described. Here, however, you can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, you can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to your advantage for verifying whether the policy your driving agent has learned is the correct one, or if it is a suboptimal policy.
Please summarize what the optimal policy is for the smartcab in the given environment. What would be the best set of instructions possible given what we know about the environment? You can explain with words or a table, but you should thoroughly discuss the optimal policy.
Next, investigate the 'sim_improved-learning.txt'
text file to see the results of your improved Q-Learning algorithm. For each state that has been recorded from the simulation, is the policy
(the action with the highest value) correct for the given state? Are
there any states where the policy is different than what would be
expected from an optimal policy?
Provide a few examples from your recorded Q-table which demonstrate that your smartcab learned the optimal policy. Explain why these entries demonstrate the optimal policy.
Try to find at least one entry where the smartcab did not learn the optimal policy. Discuss why your cab may have not learned the correct policy for the given state.
Be sure to document your state
dictionary below, it should be easy for the reader to understand what each state represents.
Answer:
Please summarize what the optimal policy is for the smartcab in the given environment. What would be the best set of instructions possible given what we know about the environment?
Change state to next waypoint unless the following inputs are observed:
Input | Value | Response | Description |
---|---|---|---|
Traffic Light | Green Light | Set Waypoint to Forward | Green light stop |
Traffic Light, Waypoint | Red Light, Not Right | Set Waypoint to None | Red light stop, right turn is legal |
Left Car | Intended Travel Right | Set Waypoint to Right | Left car is moving into lane, move right |
Right Car | Intended Travel Left | Set Waypoint to Right | Right car is moving into lane, move right |
Waypoint, Left Car | Right, Forward | Set Waypoint Forward | Next waypoint will cause accident |
Waypoint, Right Car | Left, Forward | Set Waypoint Forward | Next waypoint will cause accident |
The table is split into four columns: Input, Value, Response, and Description. The input column are the affected states and the value column is the value that the state should be in to elicit the response column. For example, the input being traffic light with a value of red means we should set our waypoint to none to stop. This rule set will be evaluated again before a final response. If the next waypoint is occupied by another agent, we move forward but if the traffic light is red then we stay idle.
Next, investigate the 'sim_improved-learning.txt'
text file to see the results of your improved Q-Learning algorithm. For each state that has been recorded from the simulation, is the policy
(the action with the highest value) correct for the given state? Are
there any states where the policy is different than what would be
expected from an optimal policy?
The state policies follow the above steps correctly. The dummy agents have a policy defined in the enviornment.py class that match the policies defined in the table. During the exploration stage of the learning, the Q-Learning model state was chosen randomly for a period of time until epilson became sufficiently small. This caused some states to have reinforced policies that were not optimal because the enviornment postively reinforces valid actions although they might not be optimal.
Provide a few examples from your recorded Q-table which demonstrate that your smartcab learned the optimal policy. Explain why these entries demonstrate the optimal policy.
The following examples are in the form: (waypoint, inputs['light'], inputs['left'], inputs['right'])
Example 1
('left', 'red', None, 'right')
-- forward : -0.78
-- left : -1.20
-- right : 0.03
-- None : 0.36
This scenario is where the Smartcab correctly stops at a red light.
Example 2
('right', 'red', 'forward', None)
-- forward : -1.95
-- left : -0.40
-- right : -0.80
-- None : 0.26
This scenario is where the Smartcab can attempt a valid right turn on a red light. The Smartcab has learned that in this scenario, taking a right turn is not optimal because it will cause an accident with the left car. We can see that the forward and left moves are also correctly negatively reinforced. The optimal policy in this scenario is to stay idle, the only postively reinforced feature in the model for this set.
Example 3
('forward', 'green', 'forward', 'left')
-- forward : 0.32
-- left : 0.01
-- right : 0.03
-- None : -0.04
This scenario is where the Smartcab correctly moves forward at a green light.
Try to find at least one entry where the smartcab did not learn the optimal policy. Discuss why your cab may have not learned the correct policy for the given state.
Incorrect Policy
('right', 'green', None, 'right')
-- forward : 0.01
-- left : 0.06
-- right : 0.03
-- None : -0.06
The Smartcab is at a green light and the next waypoint is to move right but the reinforced policy is to move left from the agent's learning. During the exploratory phase of the training, the Smartcab may have been reinforced to make a left turn from a random pick. The enviornment still rewards valid yet incorrect actions based on the penalty which is partly determined by the time remaining. This can explain why the action of right was rewarded the most instead of the optimal policy of forward.