⛏ MineRL Competition [Part 2]: A2C Navigation

In this second post, we switch from a value-based method (DQN1) to a basic policy gradient method (A2C2). We will then continue by trying to solve the NavigateDense task, this time in a non-flat world, using a more complex and rich action space. Using a policy gradient method allows us to completely exploit the action space, using also continuous actions (the camera) and combining them in a more complex fashion. Again, this post is not intended to explain how algorithms work, if you want to know more about A2C and other policy gradient methods, I think this is a pretty good and clear post.

In this post we will use a slightly modified NavigateDense environment, in which we again impose the determinism of the goal. This particular choice was motivated by the fact that the agent we designed was not able to reach the goal consistently. This is also visible in the baseline provided by PreferredNetworks, which for PPO (another policy gradient method), reaches a suboptimal mean score. Note that the maximum score possible in this environment is around 164 (100 for reaching the goal, 64 for the dense reward). Our A2C agent reaches a score of around 60 (PPO reaches 80).

In my opinion, a simple policy like the one we used, lacks of a temporal dependence to solve the task with a random position of the goal: when the agent reaches the position pointed by the compass, it needs to start an exploration of the surroundings (8 blocks away). This impose a sort of “memory” to remember which part of the spaces has been explored, instead of exploring completely at random (which sometimes leads to the goal anyway). To be more precise, if we consider the state as is (so, without the frame stacking trick), the process is non-Markovian, i.e. the future state depends also on states in the past.

This temporal dependence could be tackled or alleviated using a recurrent model or frame stacking (which I did not use). In this blog post, we don’t want to analyze this feature, but instead we want to:

  • Integrate the camera input in the policy.
  • Integrate additional actions: jump, sprint.

Policy description

Since we are using a policy gradient method, instead of estimating a value function like in DQN (i.e. for each action, we estimated the future expected reward), we want to directly model a probability distribution for each sub-action. The input on which we build these distributions is still the state, while the output of the network changes.

In this particular setting, we consider also the POV view of the agent, in addition to the previously considered compass observation. To do this, we also introduce a CNN inside the policy to process the visual observation (we use directly RGB without rescaling).

The output of the network, instead, is quite different:

  • Discrete actions are modelled as categorical or multinomial distributions.
  • Continuous actions are modelled as beta distributions (more complex than a gaussian, but takes into account the constrained support).
  • Given that A2C is an actor-critic method, we also need to estimate the value function of the current state.

Given the unconventional and complex action space of this competition, we need to assess a last issue: how do we estimate the probability of the composite actions given the probability of sub-actions? I did not find any interesting work in this specific setting, which I think would be an interesting research question. For this particular solution, we assume an independence between actions (which is not true but helps to simplify): this allows us to estimate the total probability simply as the product of sub-probabilities.

Results

In the next plots, we present results of A2C applied to this task (remember the deterministic goal). As we can see, the agent learns to consistently reach the goal in a complex world, using combinations of actions. We saved the policy in the highest point it reaches during training.

Training plots Performance during training

Test performances Replay of the episode

Conclusions

In this simplified setting, the agent is able to consistently reach the goal in the majority of runs. Inspecting the trajectories generated by this policy, we can notice how actions are taken in a very reactive fashion, which is caused by the absence of temporal dependence in the policy itself. This lack causes some of the actions taken to be very repetitive (sometimes the agent gets stuck).

As we discussed earlier, to solve the non-deterministic task, the agent needs to have a form of memory, which will be analyzed later on.

The following post will study a technique more relevant for the competition, i.e. behavioral cloning.

Bibliography

  1. Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529. 

  2. Mnih, Volodymyr, et al. “Asynchronous methods for deep reinforcement learning.” International conference on machine learning (ICML). 2016. 

Written on June 18, 2019