Monster Killer
CS 175: Project in AI (in Minecraft)
Video
Project Summary
Is it frustrating to be in trouble of the high damage taken by enemies in a dark night? Would it be nice to kill monsters smartly with less time and resources? Finding the best way to beat enemies with the lowest resource and damage taken cost is not easy for new Minecraft players, but we can use AI to learn an optimal policy to kill monsters.
In Monster Killer project, we place a Minecraft agent in an open plain terrain with one hostile mob and multiple available weapons to let the agent kill the mob with least count of actions and health costs.
The application of this project serves as a guide for players in survival mode. For new players, it teaches them how to beat monsters efficiently. For advanced players, it helps them to beat more powerful enemies and bosses.
Approach
Since our mission is to let the agent defeat an enemy with fewer actions and less HP loss, our state must contain variables indicating the position for both the agent and the enemy. One of the approaches is to take the coordinate position of the agent and the enemy as the state. We first set up a 20*20 closed terrain and the agent and the enemy could be standing in any block in the area. There are 400 blocks in this area and since there are two entities in the map, thus the total could be 800 states using the coordinate position approach. However, this approach contains too many states and the learning process is extremely slow when the agent learns to figure out what action to do in each coordinate. We alternatively decided to take the absolute value of the straight line distance from the agent to the enemy and discretize the distance values to integers as states, so there are fewer possible states, which simplifies the training process. We originally thought that this should be enough. But when were training our agent, we find out that agent behaves similarly when reach some range of distances. Thus we choose to reduce our state again by grouping every five distances together when it is greater than 5. This decision actually benefits us more as when we later choose to remove the 20x20 space limit, when we keep the number of state the same.
We initially expected the agent to choose the best weapon to attack enemies, mostly swords made of different materials, from the wooden sword with low damage but the lowest cost to the diamond sword with high damage but the highest cost. However, if we take both rarity and durability into account, the reward system might be too complicated for q-learning algorithm because the variability would be too hard to measure. In our previous stage of learning, our agent is not able to kill a stronger enemy like Vindicators or Witches because the reward system is too complicated for it to understand. Too many factors are taken into account might cause the misleading to the agent. For example, if the agent uses a diamond sword which has highest rarity costs, but the damage dealt is not as high as we expected, then the reward would be negative, which is telling the agent that attacking with diamond sword is not a good choice even though it has done some damage to the enemy. Therefore, if we give too complicated reward pattern to the agent, it would be really hard and time-costly to reach our final goal of beating mobs and this kind of misleading would weaken our agent rather than making it smarter. In order to improve this situation, we changed our final goal a little bit from the best weapon to best action combination. However, to keep this project interesting, we still keep two kinds of swords including wooden sword and diamond sword, two kinds of long-range attacks of bow and arrow for the agent to figure out which one would be the best choice.
For each action that the agent makes, there is a corresponding reward value that is being calculated. The reward and the next state does not necessarily depend on the previous action, but we expect the agent to figure out the relationship between current action reward and previous or next several steps by assigning different weights to rewards of previous and next round to current action reward. This is mostly achieved by putting n as one of the parameters when updating the q-table. One of the obvious examples is to teach agent to whether aim or not. The action aim itself do not reward agent much but if we add n as a factor, the damage dealt after aiming is also counted towards the reward for aim. In this way, the agent would better measure the effects of a sequence of actions rather than a single action. In addition, Originally we choose to update our q_table every single time when our agent choose an action. However, this comes with a catch that the learning curve for our agent strongly depend on the first few actions it chooses as it is being updated right after the action. To better avoid such circumstances, we decided to update our Q-table once after each episode rather than each action. By doing so, not only do we get better training results on our agent but also a better visualization and evaluation of how the performance of the agent improve over episodes since it is being updated at the end.
Q-Learning
The Monster Killer project applies reinforcement learning with Q-learning algorithm. With Q-learning, agent learns a policy that maximizes the reward by choosing the best action in each step. We create a Q-table to store reward of all actions in each state. During the learning process, we use a function to update Q-table with its output reward. The Q-learning function has two inputs: state and action, it returns the expected reward according to the action the agent performed in current state. The learning process contains following steps:
Initialize Q-table
Loop until learning stops:
Choose an action in current state based on current value in Q-table
Perform the action and observe the outcome
Update Q-table
When the agent starts learning, all values in Q-table are zero, the best action is unknown, so we must weigh exploration and exploitation rate properly. We use epsilon greedy policy: when the agent chooses an action, it generates a random number. If the number is greater than epsilon, it will exploit the known information by choosing the action with the best expected reward; otherwise, it will explore by choosing a random action.
We set epsilon to be 0.5 when the agent begins learning. Starting at the 50th round, we decrease epsilon by 0.1 at the end of each episode until it reaches zero. Then we continue to learn for 50 more episodes before the agent stops learning. For other special enemies that we are going to learn including Vindicators, Witches and Creepers, the model of epsilon would vary for better performance.
We update Q-table after an episode of all actions. The updated Q-value for an action in a state is affected by the discounted cumulative reward given the state and action. Here is an image for the equation below:
State Define
We define states with three factors:
Absolute straight line distance between the agent and the enemy. We discretize the distance value into 6 close range values (0, 1, 2, 3, 4, 5) and 4 long range values (5-10, 10-15, 15-20, 20+).
Life point left by the agent. The agent could act differently according to their life point status. We discretize life points into 4 states of string values: “Full” for 15-20; “High” for 10-15, “Medium” for 5-10; “Low” for 5 or below.
A binary variable for whether the agent is facing the enemy. We expect the agent learn to turn to face the enemy when the agent is not facing it.
In total, there are 1042 = 80 states possible.
Given that too much states in the same range (for instance, <5, 5-10, 10-15, 15-20, >20 are considered as same ranges when making attack) would lead to the misdirection for the agent with different q-values for each action in each state. Therefore, in order to reduce this misdirection of the unnecessary states for the agent, we would eliminate the number of states. Considering that we provide close attack weapons like wooden swords and diamond swords for the agent to choose, and we would like to let the agent to figure out the range of its valid attack, the data variability of close attacks would be higher and therefore need more states. We define distance, the first element in the state tuple, as 0, 1, 2, 3, 4, 5 for the close range, while we also define long ranges as 10, 15, 20, and 25. To simplify the state, we categorize the health points into “Full”(15-20), “High” (10-15), “Medium”(5-10) and “Low” (5 or lower) given that we would like the agent to learn how to balance the damage taken and damage dealt based on its current blood points. The last element is a binary variable of 0 (the agent is not facing the mob) and 1 (the agent is facing the mob). We place it in the state with the expectation that the agent would learn to turn to face the enemy when the agent is not facing it.
In conclusion, there are 80 different states in total. For instance, (0, “Full”, 0) is defined as the straight line distance between the agent and the enemy is 0; the agent is full-blood; and the agent is not facing the enemy; (20, “Low”, 1) is defined as the straight line distance between the agent and the enemy is between 15-20; the health points left is less than 1/4 of the full blood; and the agent is facing the enemy for successful attacks. There are 10 values for distance variable, 4 values for health points, and 2 values for the aiming flag, so there are 80 states possible.
Actions Define
We provide the agent with a bow and two kinds of swords. With the bow, the agent can shoot arrows with two different times to draw the bow: 0.4 or 0.7 seconds, which makes different damage and range of shooting. The agent can attack the front using a wooden sword or diamond sword. The agent can also move forward or backward for one distance unit. If the agent is not facing the enemy, it would have one more action to aim at the enemy. In total, there are total 7 actions for each state and the action list includes two kinds for bow shooting of shoot_0.4 and shoot_0.7, two kinds for sword attacking of wood and diamond, and three kinds for movement actions of move_forward, move_backward, and aim.
Rewards Setting
We previously made a complex calculation of weapon rarity and weapon power to evaluate costs and rewards of each of six weapons, especially the five swords made in different materials. However, to reduce the vagueness the rewards brought to the agent, simplify the learning process, and make it easier to improve learning performance, we simplified the settings of weapon reward.
We define the reward of each action with the following aspects:
If the agent chooses to make forward, backward and aim movements, it would receive a fixed reward based on the type of the enemy that the agent is attacking. For example, when it faces a zombie, it returns +0.3 reward for forward movement, and -0.3 for backward movement because we would encourage the agent to beat the zombie fast without inefficient hiding.
The damage was done on the enemy. This is calculated by the life deduction of the enemy times some constant because we would like to magnitude the positive rewards of attacking enemy successfully. If the enemy died caused by its self explosion or falling damage, it also counts as the damage dealt on the enemy by the agent. Let the reward of damage taken Y = cx, where c is a constant that might vary from different enemies and x is the life deduction of the enemy. For instance, if the agent makes a close attack using a diamond sword and cause 7 points reduce to the zombie’s health points, then the rewards would be 3 x 7 = +21 to reward the action.
Life point loss of the agent. If the agent loses life points because of the attack by the enemy, the reward would be highly related to the current life status in the state and calculated according to the table below.
Health Points left in State
Reward Calculation H with x = life point loss
Full
H = -3x
High
H = -4x
Medium
H = -5x
Low
H = -6x
We would like to agent to balance whether he should take risks to attack enemies. If the agent’s life is categorized as High or Full, then it is ok for him to take some risks by moving forward to attack. Therefore, if the agent receives damage from the enemy, but attacks the enemy successfully, its reward would not necessarily always be negative. However, if the agent’s current HP is Low or Medium, the cost of losing more blood would be very expensive because the agent should also learn how to escape from the enemy’s attack.
Time costs. Rewards of time cost are calculated by a linear function given that more time cost would lead to more costs. Therefore, we define T = - (0.5 + 0.1z), where z is the number of actions it has already made in order to punish more actions.
The reward calculation is applied after each action the agent had made with the function below:
Evaluation
We evaluate our project in four aspects: number of actions, attack accuracy (the percentage that agent attack hits the enemy with positive reward divided by the total number of attacks), average reward per action, and life points remaining at the end of each round.
We generally expect the number of actions becomes lower, and for different enemies, we have different standards to evaluate. The attack accuracy are expected to be increasing, and the best case is to reach 100%. The life remaining is also expected to be increasing and become stable after sufficient rounds of learning. We would like to see that the life remaining is full blood, which is 20 for most of the enemies, but for different enemies, we still have different standards for evaluation. The last part is the average reward per round, which is expected to increase with the number of training rounds. The exact number would depend on the enemies blood and other special features. We also expect all those numbers to become stable with more rounds of learning and after epsilon is set to 0.
We evaluate the performance of the AI with several kinds of enemies: zombie, witch, vindicator, and creeper.
Killing a Zombie is the baseline for this project, given that zombies do not have special features. The agent in this case is place in 5 blocks away from the zombie, without any armor and weapon enhancements. We set reward for frontward movement to be 0.5 because we encourage the agent to try close attack weapons and make his own decision on the best weapon. The reward of backward movement is set to be -0.5 because we don’t encourage the agent to hide from zombies. For the health point loss, we give heavy punishment considering that the agent should learn how to beat the zombie without any HP loss.
In order to increase the variability of the data, we set epsilon to 0.5 for first 30 round of learning, and decrease by 0.02 per round after 30 round. In other words, after 55 rounds, the data should be relatively stable and ideal. We trained the agents for 100 rounds in total. As suggested in the graph, the agent has made obvious progress through learning. The number of actions has the trend of decrease. The Pearson Correlation value between the number of actions and the round number is -0.37, which proves that those two factors are negatively related and therefore, indicates that the agent is making fewer actions with more rounds of learning. The attack accuracy is also suggesting a clear increasing trend of the data, which shows that the agent is making fewer invalid actions and winning more rewards. The scatter of life remaining shows high variability in first 40 rounds given that the epsilon is quite high at those stages and the agent don’t have a high probability of survival. However, after 40 rounds, the points are concentrated in the full blood area, which demonstrates the success of killing the zombie without any HP loss. In addition, the graph of average rewards per action also has an increasing pattern and after 60 rounds, the data is relatively stable and dense in the range of 3-5. The average rewards per action is strongly and positively related to the number of rounds with a correlation value of 0.71. Overall, we believe this is a successful base case for our project: The agent is able to kill the zombie in less than 9 actions with no HP loss and high attack accuracy.
Here is the evaluation graphs of killing Zombie:
After we trained our agent to kill zombies successfully, we pay more attention to kill enemies with more special features. For here, we train the agent to attack a witch. This enemy has ability to make ranged attack, it can make DOT damage (damage over time) to the agent as well as recover its own health. In this case, we place the witch 12 blocks away from the enemy, with diamond armor and no weapon enhancement. We set reward for forward movement to be 0.3 and backward movement is set to be -0.3. We set epsilon to 0.5 for the first 60 round of learning, and decrease by 0.025 per round after 60 rounds. We trained the agents for 120 rounds in total.
Through those evaluations, we believe that the agent is also making great progress. From the change of numbers of actions, we could notice that after the epsilon equals to 0, those points are highly focused on the right conner, which is in the range of 10 to 20. For witches, we believe 10-20 actions would be reasonable and efficient because witches could recover their health points and the damage of our weapons is also between 4-9. For the attack accuracy, we could easily tell that in first 70 rounds, the variability is quite high and those points are highly concentrated in the lower range of 0 - 0.3 while in the 80 to 140 rounds, the accuracy is more concentrated in the higher range of 0.6 - 0.9. For the blood left, we could obviously tell that after 60 rounds, the density is high in the full blood level, which means the agent could beat the witch with higher probability and no HP loss. Reward per action is also in an increasing trend with that of the number of rounds. The correlation between them is 0.38, which indicates the positive relationship between them. In addition, by looking at the scatter, we could tell that after 100 rounds, the variability is relatively smaller – only several points are still in the lower range while most points are distributed in the higher range from 5 - 10. From evaluation in those four aspects, we are confident that our agent is smart to beat a witch with no HP loss, high attack accuracy and reasonable counts of actions.
Here is the evaluation graphs of killing Witch:
Our next target is vindicator, this monster has high damage and high movement speed. In this case, we place the vindicator 5 blocks away in x-axis and 2 blocks away in z-axis from the agent, with full armor and no enhancement. We set reward for frontward movement to be 1 and backward movement is set to be the same. We set epsilon to 0.45 for the first 150 round of learning, and decrease by 0.01 per round after 150 round. We trained the agents for 171 rounds in total.
Given that Vindicator is quite strong and it moves really fast, and in our base case, we are not able to kill it. We made some improvements in our reward system and adjust the speed of the movement of our agent. We believe this is an acceptable result based on our evaluation. For the number of actions, we could tell that the dataset overall is in a decreasing trend the Pearson Correlation value of -0.469. Also, we could tell that besides outliers, the data tends to be more stable in the range of 20 to 30. For the accuracy, we could tell some sorts of increase in those data, but not that clear. The variance is kind of consistent but what we could tell from the scatter plot is that after 100 rounds, the accuracy is always higher than 0.35, which means the agent starts to make less mistakes while attacking after 70 rounds of learning. We hope that we could make it more stable in the future, but given that the vindicator moves very fast and we have no clue of what kinds of movement it would make in each round, some variability is acceptable. For the life remaining data, we could tell that in the first 75 rounds, the agent is having a hard time beating down the Vindicator and the survival rate is very small. However, after the epsilon equals to 0 (in round 150 - 175), the probability of survival is highly increased. The average reward per action is also positively related to number of rounds. We could also tell that the variability of data is in the range of -1 to 3 after 130 rounds of learning while the variability of data is in the lower range of -4 to 0 in first 30 rounds. In conclusion, we believe that agent is making progress while learning. It could defeat the Vindicator with relatively low HP costs and reasonable counts of actions. Even though we wish it could do better to fit our final goal of efficiency, the result is still acceptable.
Here is the evaluation graphs of killing Vindicator:
Finally, we train to attack creepers. It’s a special monster because it does not attack but explode when it approaches the agent, and it makes high damage in explosion. In this case, we place the creeper 5 blocks in x-axis and 3 blocks in z-axis away from the agent with zero armor and no enhancement. We set reward for frontward movement to be -1 and backward movement is set to be 2. We set epsilon to 0.45 for the first 50 round of learning, and decrease by 0.05 per round after 50 round. We trained the agents for 75 rounds in total.
The Creeper gives us a special challenge where it will only attack once each round (explosion will kill itself and the agent if they two are close to each other). This makes the calculation of damage taken especially hard as it will only affect one action when n=1. After we do some adjustment for our algorithm, the agent is able to constantly kill the creeper and avoid taking much damage. According to the graph above, the number of action do not change much in most of the case. The reason behind this is that creeper’s exist-time is almost constant in most of the round (explosion time is fixed). The agent whether die with the creeper under that fixed time or able to kill it before the explosion. This is also true under the Number of Actions graph versus the number of round – the variability of the data is almost constant.However, if we look closely, we could tell that after 50 rounds of training, the number of actions will no longer be greater than 10, which means the agent learned how to kill the creeper before the explosion. The life remaining graph is also consistent with it. After 50 rounds, the survival rate is quite stable with more than 12.5 points of blood left. The agent is able to move backward when the distance is too small and keep HP lose under 8 point with or without killing the creeper. In the chart for attack accuracy, we once again see that the overall trend is increasing and the agent is able to make every attack meaningful. The Pearson Correlation value is 0.44 between the accuracy and number of round. In the reward per action graph, we can see that there are multiple points which drop below the -5 level. It is mainly because the creeper kill the agent within a close distance and that is the worst case. We increase the punishment for death, since it is especially important to keep alive and escape from the attacks when fighting against a mob like creeper. With more rounds of training, the agent would barely die or take high damage. The average reward barely drops below 0, which indicates that the agent is doing meaningful jobs with positive rewards.
Microsoft provided documentations for classes and XML schemas of Malmo. The class documentary provides library function for Malmo project. The XML schema documentary provides options of world and agent setting in mission specification XML. Viewing Malmo source code is also valuable, it provides detailed information for library functions and mission specifications.
Minecraft Wiki is very helpful when we are doing researches on different weapons. It presents data of durability, attack damages and lifetime damage inflicted. Based on those data, we defined the reward systems more smoothly. For the rarity data, we get it manually from using MCEdit.
For the AIML sources, we use the assignment2.py for reference given that we are also applying a Q-learning algorithm.