Machine Learning Engineer Nanodegree¶

Reinforcement Learning¶

Project: Train a Smartcab to Drive¶


Getting Started¶

In this project, you will work towards constructing an optimized Q-Learning driving agent that will navigate a Smartcab through its environment towards a goal. Since the Smartcab is expected to drive passengers from one location to another, the driving agent will be evaluated on two very important metrics: Safety and Reliability. A driving agent that gets the Smartcab to its destination while running red lights or narrowly avoiding accidents would be considered unsafe. Similarly, a driving agent that frequently fails to reach the destination in time would be considered unreliable. Maximizing the driving agent's safety and reliability would ensure that Smartcabs have a permanent place in the transportation industry.

Safety and Reliability are measured using a letter-grade system as follows:

Grade Safety Reliability
A+ Agent commits no traffic violations,
and always chooses the correct action.
Agent reaches the destination in time
for 100% of trips.
A Agent commits few minor traffic violations,
such as failing to move on a green light.
Agent reaches the destination on time
for at least 90% of trips.
B Agent commits frequent minor traffic violations,
such as failing to move on a green light.
Agent reaches the destination on time
for at least 80% of trips.
C Agent commits at least one major traffic violation,
such as driving through a red light.
Agent reaches the destination on time
for at least 70% of trips.
D Agent causes at least one minor accident,
such as turning left on green with oncoming traffic.
Agent reaches the destination on time
for at least 60% of trips.
F Agent causes at least one major accident,
such as driving through a red light with cross-traffic.
Agent fails to reach the destination on time
for at least 60% of trips.

To assist evaluating these important metrics, you will need to load visualization code that will be used later on in the project. Run the code cell below to import this code which is required for your analysis.

In [2]:
# Import the visualization code
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

Understand the World¶

Before starting to work on implementing your driving agent, it's necessary to first understand the world (environment) which the Smartcab and driving agent work in. One of the major components to building a self-learning agent is understanding the characteristics about the agent, which includes how the agent operates. To begin, simply run the agent.py agent code exactly how it is -- no need to make any additions whatsoever. Let the resulting simulation run for some time to see the various working components. Note that in the visual simulation (if enabled), the white vehicle is the Smartcab.

Question 1¶

In a few sentences, describe what you observe during the simulation when running the default agent.py agent code. Some things you could consider:

  • Does the Smartcab move at all during the simulation?
  • What kind of rewards is the driving agent receiving?
  • How does the light changing color affect the rewards?

Hint: From the /smartcab/ top-level directory (where this notebook is located), run the command

'python smartcab/agent.py'

Answer:

  • Does the Smartcab move at all during the simulation?

The Smartcab remains idle through the entire simulation. The agent is stopped at a light and the simulation runs between red and green lights. The simulation given does not move the Smartcab.

  • What kind of rewards is the driving agent receiving?

Becaue the cab is waiting at a light, the rewards are given based on the action according to the light. There are also other cars and pedestrians moving while the Smartcab remains idle and the cab is rewarded based on the safety and reliability metrics outlined above.

  • How does the light changing color affect the rewards?

When the light turns green, the cab is expected to move according to the traffic laws. The Smartcab is rewarded a negative weight if the idle position is not the correct response to a green traffic light. The Smartcab is rewarded a postivie weight if the the traffic light is red and it is the correct response according to the rules set in the simulation.

Understand the Code¶

In addition to understanding the world, it is also necessary to understand the code itself that governs how the world, simulation, and so on operate. Attempting to create a driving agent would be difficult without having at least explored the "hidden" devices that make everything work. In the /smartcab/ top-level directory, there are two folders: /logs/ (which will be used later) and /smartcab/. Open the /smartcab/ folder and explore each Python file included, then answer the following question.

Question 2¶

  • In the agent.py Python file, choose three flags that can be set and explain how they change the simulation.
  • In the environment.py Python file, what Environment class function is called when an agent performs an action?
  • In the simulator.py Python file, what is the difference between the 'render_text()' function and the 'render()' function?
  • In the planner.py Python file, will the 'next_waypoint() function consider the North-South or East-West direction first?

Answer:

Agent.py¶

1. num_dummies¶

The num_dummies flag changes the number of dummy agents in the enviornment where the default is 100. This can change the safety and reliability measurements because there is a positive correlation with the number of dummy agents and the chances of accidents. With a high number, the Smartcab is more prone to get in an accident while a low number might not give enough learning opportunities in the simulation to reinforce desired outcomes.

2. enforce_deadline¶

The enforce_deadline flag will set a timer for the Smartcab to reach the desitination. The timer can then be analyzed against different parameters within our simulation to achieve desired outcomes. If we want to focus on reliability, it is essential that the Smartcab reaches the destination on time.

3. n_test¶

The n_test flag sets the number of testing trials to perform where the default is 0. This can be used to test our Smartcab once it has been trained in different enviornments and we can analyze outcomes. We can adjust the reinforcement weights, number of dummy agents, epsilon, alpha, and tolerance to adjust the agent and enviornment to produce an optimized outcome based on the tests.

Enviornment.py¶

Within the Enviornment class the act() function is called when an agent performs an action. The function will first conside the action and evaulate it based on the traffic laws then perform the called action. Once it has completed an action, it will receive a reward for the action. This function takes three arguments: self, agent, action where valid actions are either None, forward, left, or right.

Simulator.py¶

There are two types of output for the simulation to be user friendly. There is the render() function which will use the pygame library to create a graphical user interface for the simnulation and the render_text() function which is used for console output. The render_text() function provides updates to the enviornment and the rewards at every step while the render() function does this through a graphical user interface (GUI).

Planner.py¶

The planner.py file considers the enviornment based on a cartesian graph with x and y coordinates. Within the file, the next_waypoint() function utilizes the x and y axis which can be abstracted to East-West and North-South. The function will first determine the difference in location from the current location to the destiation and then sets up flags to see if there are differences in the x an y coordinates. The x coordinate is checked first on line 39 which can be abstracted to the East-West direction while the y coordinate is checked last at line 59 which is abstracted to the North-South direction.


Implement a Basic Driving Agent¶

The first step to creating an optimized Q-Learning driving agent is getting the agent to actually take valid actions. In this case, a valid action is one of None, (do nothing) 'left' (turn left), right' (turn right), or 'forward' (go forward). For your first implementation, navigate to the 'choose_action()' agent function and make the driving agent randomly choose one of these actions. Note that you have access to several class variables that will help you write this functionality, such as 'self.learning' and 'self.valid_actions'. Once implemented, run the agent file and simulation briefly to confirm that your driving agent is taking a random action each time step.

Basic Agent Simulation Results¶

To obtain results from the initial simulation, you will need to adjust following flags:

  • 'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
  • 'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
  • 'log_metrics' - Set this to True to log the simluation results as a .csv file in /logs/.
  • 'n_test' - Set this to '10' to perform 10 testing trials.

Optionally, you may disable to the visual simulation (which can make the trials go faster) by setting the 'display' flag to False. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!

Once you have successfully completed the initial simulation (there should have been 20 training trials and 10 testing trials), run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded! Run the agent.py file after setting the flags from projects/smartcab folder instead of projects/smartcab/smartcab.

In [3]:
# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')

Question 3¶

Using the visualization above that was produced from your initial simulation, provide an analysis and make several observations about the driving agent. Be sure that you are making at least one observation about each panel present in the visualization. Some things you could consider:

  • How frequently is the driving agent making bad decisions? How many of those bad decisions cause accidents?
  • Given that the agent is driving randomly, does the rate of reliability make sense?
  • What kind of rewards is the agent receiving for its actions? Do the rewards suggest it has been penalized heavily?
  • As the number of trials increases, does the outcome of results change significantly?
  • Would this Smartcab be considered safe and/or reliable for its passengers? Why or why not?

Answer:

  • How frequently is the driving agent making bad decisions? How many of those bad decisions cause accidents?

Based on the panel named 10-Trial Rolling Relative Frequency of Bad Actions the relative frequency of the total bad actions neared 0.41 or 41% after 20 trials. This can be explained because learning was turned off for this agent and the bad decisions were not negatively reinforced. The relative frequency for the number of accidents was around 0.05 or 5% after 20 trials. The majority of accidents were major and the frequency of minor accidents was low.

  • Given that the agent is driving randomly, does the rate of reliability make sense?

The rates of reliability are reasonable given that the agent was driving randomly. This can also be seen in the relative freqencies with no slope after iterating through trials. Without learning, the agent will maintain the same model and the rate of reliability will not change.

  • What kind of rewards is the agent receiving for its actions? Do the rewards suggest it has been penalized heavily?

Due to the high rate of accidents, this agent is receiving an negative average of rewards. This agent was penalized throughout the trials because it wasn't able to update the model through machine learning. The agent committed an increasing frequency of Total Bad Actions and the reward suggest it was negatively reinforced with greater magnitude than before.

  • As the number of trials increases, does the outcome of results change significantly?

The outcomes don't change as the number of trials increases because the agent does not learn. The model the agent uses is constant throughout the trials and the enviornment is also constant. With a constant model and enviornment we can expect there to be no change in reliability or safety outcomes.

  • Would this Smartcab be considered safe and/or reliable for its passengers? Why or why not?

No, the current Smartcab without learning cannot be considered safe or reliable for any passenger. Because the model does not learn it follows the same rules and methods which have resulted in an F safety and reliability rating. The agent may be able to improve but without a learning model it was no way of changing in the simulation.


Inform the Driving Agent¶

The second step to creating an optimized Q-learning driving agent is defining a set of states that the agent can occupy in the environment. Depending on the input, sensory data, and additional variables available to the driving agent, a set of states can be defined for the agent so that it can eventually learn what action it should take when occupying a state. The condition of 'if state then action' for each state is called a policy, and is ultimately what the driving agent is expected to learn. Without defining states, the driving agent would never understand which action is most optimal -- or even what environmental variables and conditions it cares about!

Identify States¶

Inspecting the 'build_state()' agent function shows that the driving agent is given the following data from the environment:

  • 'waypoint', which is the direction the Smartcab should drive leading to the destination, relative to the Smartcab's heading.
  • 'inputs', which is the sensor data from the Smartcab. It includes
    • 'light', the color of the light.
    • 'left', the intended direction of travel for a vehicle to the Smartcab's left. Returns None if no vehicle is present.
    • 'right', the intended direction of travel for a vehicle to the Smartcab's right. Returns None if no vehicle is present.
    • 'oncoming', the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None if no vehicle is present.
  • 'deadline', which is the number of actions remaining for the Smartcab to reach the destination before running out of time.

Question 4¶

Which features available to the agent are most relevant for learning both safety and efficiency? Why are these features appropriate for modeling the Smartcab in the environment? If you did not choose some features, why are those features not appropriate? Please note that whatever features you eventually choose for your agent's state, must be argued for here. That is: your code in agent.py should reflect the features chosen in this answer.

NOTE: You are not allowed to engineer new features for the smartcab.

Answer:

The features most relevant for safety include all of the inputs including 'light', 'left', 'right', and 'oncoming'. These are safety guidelines that inform the agent about the current enviornment and other vehicles. The 'light' and 'oncoming' features are relevant to safety at intersections and traffic stops. The 'left' and 'right' features are more relevant while the vehicle is in motion. The 'right' feature was not included because the enviornment and traffic laws do not consider this an invalid move including at a 'red' light.

The feature most relevant for efficiency is waypoint. This feature can allow the agent to adjust the Smartcab based on destination and allows reinforcement learning for efficiency to occur. A learning model is either negatively or positively reinforced relative to the waypoint target. The waypoint is necessary for efficiency in combination with a deadline to allow the agent to act in the intended way.

I chose to include all features with my model except for right and deadline because of the relevance and to preserve a reasonable state space. The deadline feature has too many possible values and can have a disproportionate affect on the state space. Out of all other features, deadline has the most possible features which will increase the state space the most. From not considering deadline, we're able to significantly reduce the state space while maintaining relevancy for our features. The right feature was not relevant in terms of traffic laws and was not included in the Q-Learning model for the Smartcab.

Define a State Space¶

When defining a set of states that the agent can occupy, it is necessary to consider the size of the state space. That is to say, if you expect the driving agent to learn a policy for each state, you would need to have an optimal action for every state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions. For example, consider a case where the following features are used to define the state of the Smartcab:

('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day').

How frequently would the agent occupy a state like (False, True, True, True, False, False, '3AM')? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!

Question 5¶

If a state is defined using the features you've selected from Question 4, what would be the size of the state space? Given what you know about the environment and how it is simulated, do you think the driving agent could learn a policy for each possible state within a reasonable number of training trials?
Hint: Consider the combinations of features to calculate the total number of states!

Answer:

The number of possible states is the count of all possible combinations of our features. I chose 4 features to balance safety, reliability, and simulation outcomes. Out of the 4, 2 were based on direction with 4 possible values, 'light', is binary with either red or green, and waypoint has 3 possible directional values. The total number of combinations is 96 or 4 x 4 x 3 x 2. If a trial has 20 steps, there are a possible 400 steps that it can learn from 20 trials. This is reasonable for the agent to train with 96 possible states or 24% of the total amount of possible training inputs based on an average of 20 steps per trial with 20 trials. As we will see in the upcoming section, the final Q-Learning models needs 300 trials before reaching the treshold meaning the state space will occupy approximately 2% based on an average of 20 steps per trial.

Update the Driving Agent State¶

For your second implementation, navigate to the 'build_state()' agent function. With the justification you've provided in Question 4, you will now set the 'state' variable to a tuple of all the features necessary for Q-Learning. Confirm your driving agent is updating its state by running the agent file and simulation briefly and note whether the state is displaying. If the visual simulation is used, confirm that the updated state corresponds with what is seen in the simulation.

Note: Remember to reset simulation flags to their default setting when making this observation!


Implement a Q-Learning Driving Agent¶

The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible. For this project, you will be implementing a decaying, ϵ-greedy Q-learning algorithm with no discount factor. Follow the implementation instructions under each TODO in the agent functions.

Note that the agent attribute self.Q is a dictionary: This is how the Q-table will be formed. Each state will be a key of the self.Q dictionary, and each value will then be another dictionary that holds the action and Q-value. Here is an example:

{ 'state-1': { 
    'action-1' : Qvalue-1,
    'action-2' : Qvalue-2,
     ...
   },
  'state-2': {
    'action-1' : Qvalue-1,
     ...
   },
   ...
}

Furthermore, note that you are expected to use a decaying ϵ (exploration) factor. Hence, as the number of trials increases, ϵ should decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior. Additionally, The agent will be tested on what it has learned after ϵ has passed a certain threshold (the default threshold is 0.05). For the initial Q-Learning implementation, you will be implementing a linear decaying function for ϵ.

Q-Learning Simulation Results¶

To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:

  • 'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
  • 'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
  • 'log_metrics' - Set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
  • 'n_test' - Set this to '10' to perform 10 testing trials.
  • 'learning' - Set this to 'True' to tell the driving agent to use your Q-Learning implementation.

In addition, use the following decay function for ϵ:

ϵt+1=ϵt0.05,for trial number t

If you have difficulty getting your implementation to work, try setting the 'verbose' flag to True to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!

Once you have successfully completed the initial Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

In [6]:
# Load the 'sim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('sim_default-learning.csv')

Question 6¶

Using the visualization above that was produced from your default Q-Learning simulation, provide an analysis and make observations about the driving agent like in Question 3. Note that the simulation should have also produced the Q-table in a text file which can help you make observations about the agent's learning. Some additional things you could consider:

  • Are there any observations that are similar between the basic driving agent and the default Q-Learning agent?
  • Approximately how many training trials did the driving agent require before testing? Does that number make sense given the epsilon-tolerance?
  • Is the decaying function you implemented for ϵ (the exploration factor) accurately represented in the parameters panel?
  • As the number of training trials increased, did the number of bad actions decrease? Did the average reward increase?
  • How does the safety and reliability rating compare to the initial driving agent?

Answer:

  • Are there any observations that are similar between the basic driving agent and the default Q-Learning agent?

The preliminary trials showed similar results compared to the basic driving agent. This is because the default Q-Learning agent has not undergone reinforcement learning at the first trial and has a model identical to the basic driving agent. One similarity included more major traffic violations than minor traffic violations with both types of agents that can be due to the dummy agents and the enviornment.

  • Approximately how many training trials did the driving agent require before testing? Does that number make sense given the epsilon-tolerance?

The number of training trials before the driving agent had a suitable model was around 20. After 20 iterations, the reliability of the Smartcab was greater than 90% and had lower incidents of traffic violations and accidents. The decay rate was set to decline linearly at -0.05 with an initial value of 1. This means that after 20 trials, the Smartcab stopped performing random moves. This makes sense with the number of training trials it requires for testing.

  • Is the decaying function you implemented for ϵ (the exploration factor) accurately represented in the parameters panel?

The decay function was set to linearly decline at a rate of -0.05 per trial and after 20 trials it should have a value of 0. The paramater appropriately represented both characteristics of the decay function described. The learning factor was held constant.

  • As the number of training trials increased, did the number of bad actions decrease? Did the average reward increase?

As expected, the simulation negatively reinforced the Smartcab's model of the enviornment which caused a reduced number of bad actions. This allowed for less violations and accidents. With the lower amount of negative reinforcement the simulation began having a positive reinforcement value for the agent on average after consistently negatively reincorcing bad actions.

  • How does the safety and reliability rating compare to the initial driving agent?

The safety rating of the Smartcab stayed the same while the reliability dramatically increased from an F to an A. A rating of F means it met the goal less than 60% of the time while a rating of A means it met the goal more than 90% of the time. This indicated that the safety rating of the agent needs more training to reach an acceptable rating. The relative frequency of violations and accidents decreased compared to the initial driving agent but the total rating did not change.


Improve the Q-Learning Driving Agent¶

The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency. Typically this step will require a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if ϵ=1 and never decays) will certainly make it learn, but never let it act. When improving on your Q-Learning implementation, consider the implications it creates and whether it is logistically sensible to make a particular adjustment.

Improved Q-Learning Simulation Results¶

To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:

  • 'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
  • 'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
  • 'log_metrics' - Set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
  • 'learning' - Set this to 'True' to tell the driving agent to use your Q-Learning implementation.
  • 'optimized' - Set this to 'True' to tell the driving agent you are performing an optimized version of the Q-Learning implementation.

Additional flags that can be adjusted as part of optimizing the Q-Learning agent:

  • 'n_test' - Set this to some positive number (previously 10) to perform that many testing trials.
  • 'alpha' - Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.
  • 'epsilon' - Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.
  • 'tolerance' - set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.

Furthermore, use a decaying function of your choice for ϵ (the exploration factor). Note that whichever function you use, it must decay to 'tolerance' at a reasonable rate. The Q-Learning agent will not begin testing until this occurs. Some example decaying functions (for t, the number of trials):

ϵ=at,for 0<a<1ϵ=1t2ϵ=eat,for 0<a<1ϵ=cos(at),for 0<a<1

You may also use a decaying function for α (the learning rate) if you so choose, however this is typically less common. If you do so, be sure that it adheres to the inequality 0α1.

If you have difficulty getting your implementation to work, try setting the 'verbose' flag to True to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!

Once you have successfully completed the improved Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

In [6]:
# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')

Question 7¶

Using the visualization above that was produced from your improved Q-Learning simulation, provide a final analysis and make observations about the improved driving agent like in Question 6. Questions you should answer:

  • What decaying function was used for epsilon (the exploration factor)?
  • Approximately how many training trials were needed for your agent before begining testing?
  • What epsilon-tolerance and alpha (learning rate) did you use? Why did you use them?
  • How much improvement was made with this Q-Learner when compared to the default Q-Learner from the previous section?
  • Would you say that the Q-Learner results show that your driving agent successfully learned an appropriate policy?
  • Are you satisfied with the safety and reliability ratings of the Smartcab?

Answer:

  • What decaying function was used for epsilon (the exploration factor)?

Decay Function: where a is alpha and t is training index

ϵ=eat,for 0<a<1
The alpha used was 0.01 to allow for a slower learning rate.

  • Approximately how many training trials were needed for your agent before begining testing?

The threshold was defaulted to 0.05 which meant a total of 300 training trials before testing based on a learning rate of 0.01. This provided reasonable computation time and produced favorable outcomes. The model was able to thoroughly learn the simulation before numerous testing trials.

  • What epsilon-tolerance and alpha (learning rate) did you use? Why did you use them?

The learning rate or alpha was set to 0.01 to allow for a slower learning rate. The safety rating was low and required additional training while still maintaining the high reliability rating of a faster learning rate. This meant that the Smartcab was able to learn a model that was safe while still reliable.

  • How much improvement was made with this Q-Learner when compared to the default Q-Learner from the previous section?

There was significant improvement made from the default Q-Learner that failed the safety rating but aced the reliability. The improved Q-Learner was able to produce a safety rating of A+ while still maintaining an A for the reliability. The most improvement was made on the safety rating.

  • Would you say that the Q-Learner results show that your driving agent successfully learned an appropriate policy?

The Q-Learner results indicate good to excellent performance on both safety and reliability. The safety rating was able to be perfect (A+) while the reliability was nearly perfect (A). The model was thoroughly tested on 50 testing runs and the results prove that the Smartcab learned an appropritate policy.

  • Are you satisfied with the safety and reliability ratings of the Smartcab?

Both the safety and reliability ratings of the Smartcab are satisfied and the policies can produce efficient, safe, and reliable outcomes in the simulation. The safety of the Smartcab during the testing trials was perfect with an A+ rating while the reliability reached the goal more than 90% of the time. The reliability of the Smartcab could be made perfect by including additional factors such as additional inputs which was originally omitted to allow for a reasonable set space.

Define an Optimal Policy¶

Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described. Here, however, you can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, you can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to your advantage for verifying whether the policy your driving agent has learned is the correct one, or if it is a suboptimal policy.

Question 8¶

  1. Please summarize what the optimal policy is for the smartcab in the given environment. What would be the best set of instructions possible given what we know about the environment? You can explain with words or a table, but you should thoroughly discuss the optimal policy.

  2. Next, investigate the 'sim_improved-learning.txt' text file to see the results of your improved Q-Learning algorithm. For each state that has been recorded from the simulation, is the policy (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy?

  3. Provide a few examples from your recorded Q-table which demonstrate that your smartcab learned the optimal policy. Explain why these entries demonstrate the optimal policy.

  4. Try to find at least one entry where the smartcab did not learn the optimal policy. Discuss why your cab may have not learned the correct policy for the given state.

Be sure to document your state dictionary below, it should be easy for the reader to understand what each state represents.

Answer:

  • Please summarize what the optimal policy is for the smartcab in the given environment. What would be the best set of instructions possible given what we know about the environment?

    Change state to next waypoint unless the following inputs are observed:

Input Value Response Description
Traffic Light Green Light Set Waypoint to Forward Green light stop
Traffic Light, Waypoint Red Light, Not Right Set Waypoint to None Red light stop, right turn is legal
Left Car Intended Travel Right Set Waypoint to Right Left car is moving into lane, move right
Right Car Intended Travel Left Set Waypoint to Right Right car is moving into lane, move right
Waypoint, Left Car Right, Forward Set Waypoint Forward Next waypoint will cause accident
Waypoint, Right Car Left, Forward Set Waypoint Forward Next waypoint will cause accident
The table is split into four columns: Input, Value, Response, and Description. The input column are the affected states and the value column is the value that the state should be in to elicit the response column. For example, the input being traffic light with a value of red means we should set our waypoint to none to stop. This rule set will be evaluated again before a final response. If the next waypoint is occupied by another agent, we move forward but if the traffic light is red then we stay idle.


  • Next, investigate the 'sim_improved-learning.txt' text file to see the results of your improved Q-Learning algorithm. For each state that has been recorded from the simulation, is the policy (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy?

    The state policies follow the above steps correctly. The dummy agents have a policy defined in the enviornment.py class that match the policies defined in the table. During the exploration stage of the learning, the Q-Learning model state was chosen randomly for a period of time until epilson became sufficiently small. This caused some states to have reinforced policies that were not optimal because the enviornment postively reinforces valid actions although they might not be optimal.

  • Provide a few examples from your recorded Q-table which demonstrate that your smartcab learned the optimal policy. Explain why these entries demonstrate the optimal policy.

    The following examples are in the form: (waypoint, inputs['light'], inputs['left'], inputs['right'])

    Example 1

    ('left', 'red', None, 'right') -- forward : -0.78 -- left : -1.20 -- right : 0.03 -- None : 0.36

    This scenario is where the Smartcab correctly stops at a red light.

    Example 2

    ('right', 'red', 'forward', None) -- forward : -1.95 -- left : -0.40 -- right : -0.80 -- None : 0.26

    This scenario is where the Smartcab can attempt a valid right turn on a red light. The Smartcab has learned that in this scenario, taking a right turn is not optimal because it will cause an accident with the left car. We can see that the forward and left moves are also correctly negatively reinforced. The optimal policy in this scenario is to stay idle, the only postively reinforced feature in the model for this set.

    Example 3

    ('forward', 'green', 'forward', 'left') -- forward : 0.32 -- left : 0.01 -- right : 0.03 -- None : -0.04

    This scenario is where the Smartcab correctly moves forward at a green light.

  • Try to find at least one entry where the smartcab did not learn the optimal policy. Discuss why your cab may have not learned the correct policy for the given state.

    Incorrect Policy

    ('right', 'green', None, 'right') -- forward : 0.01 -- left : 0.06 -- right : 0.03 -- None : -0.06

    The Smartcab is at a green light and the next waypoint is to move right but the reinforced policy is to move left from the agent's learning. During the exploratory phase of the training, the Smartcab may have been reinforced to make a left turn from a random pick. The enviornment still rewards valid yet incorrect actions based on the penalty which is partly determined by the time remaining. This can explain why the action of right was rewarded the most instead of the optimal policy of forward.