⛏ MineRL Competition [Part 1]: Domain Analysis and First Simple Solution

MineRL is a recently started NeurIPS19 competition, with the goal of sample-efficient DRL algorithms in a very complex and hierarchical problem. I always had a passion for minecraft, since it is a quite complete game, which (in humans) stimulate strategy, exploration and building creativity. MineRL was created starting from Malmo, in which other competitions were hosted in the past (e.g. this one). This time, I also want to thank the organizers for the excellent job of building a simple package ready to use (Malmo, instead, was quite complex to setup and to use).

In the recent years, we have seen a lot of impressive DRL achievements (citing the common ones, Alphastar and OpenAI Five), but a lot of them share the common issue of sample complexity: to reach a satisfactory result, a huge amount of sampled experiences need to be gathered. This competition has the exact goal of understanding this issue better: can we design sample-efficient algorithms which learn to act in complex environments with the aid of a dataset of human demonstrations? For this reason, I think the main topics to be covered in this competition are:

  • Inverse RL to exploit the dataset
  • Transfer between tasks (each “higher-level” task contains part of the lower ones)
  • Hierarchical controller: choosing sequence of actions (options)
  • Model based: using the demonstrations to learn a model of the environment

These are the first things that come to my mind, but we will see what happens next! In this post, we will approach the problem in one of its simpler configurations, first analyzing the corresponding dataset, and then applying a simple DQN1 to the environment.

Domain Analysis

In this particular part, we will only use the environment MineRLNavigateDense-v0, which is the easiest one. It requires to navigate to a given block (diamond), which is placed on the surface at a distance of 64 blocks. It is dense, i.e. the reward is shaped to point exactly to the block. We will use this property to use low values of gamma, since an almost greedy controller is good in this setting of dense reward. Also the dataset has the same task division, so we will focus on the same one.

When we deal with offline datasets, it is always better to explore them a little bit, both to understand the problem better but also to cross-check that everything works fine. The basic action to do on the dataset is to call data.sarsd_iter(), which creates a generator of sequences belonging to the same episode, return observations, actions, rewards and termination flags. We can see an example of images returned by the dataset in the following image. The observation also contains a compass angle, pointing to a position in the proximity of the goal. In the following plot, we also show the reward per timestep inside the same episode, which has a value a bit lower than 1.0 (which is the reward of moving exactly one block towards the goal).

Sequence of observations Sequence of compass and reward

We can notice how the compass gets more “unstable” near the end, since the angle changes more rapidly when the distance decreases. Also, we notice the last spike in the return, which is the +100 reward obtained by reaching the goal block. If you want to read more on the environment or the dataset, go to MineRL documentation.

Since we are interested in applying DQN, which has a constraint on discrete actions, we need to look at the action space in detail. In MineRL, actions are described as a dictionary of possible sub-actions, which are controllable independently. These sub-actions form a macro action that is performed by the agent at each timestep. Among these sub-actions, only one is continuous: the one that controls the camera movement, expressed as a tuple of pitch and yaw movement (see below picture), from -180 to +180 degrees per tick.

Camera actions

Since our simple approach uses DQN, we need to understand how to discretize these continuous actions. Instead of relying on “expert-guess”, we can look directly at the data we have.

Action distribution Camera distribution, notice the logarithmic scale.

Given this distribution, we chose to approximate the camera actions in each axis using

as possible values. For all the possible actions, we have values which are either 0 or 1 (meaning pressed of not), so we can plot their mean to understand how much they are used.

Action means

We can notice that forward, sprint and jump are the most used actions in this task, which is reasonable given that this is a purely exploratory task, so the player only need to move around and find the goal block.

One thing that could be useful in the future, is the correlation of action: given the complex combinations of sub-actions, it could be interesting to aggregate them using a measure of correlation. Just to give a first look, see below heatmap. We can notice an high correlation between again, forward, sprint and jump. This also means that, in further approaches, we will need to account for a conditional dependence of actions.

Action heatmap

Super Trivial Solution

In this post, the goal is obviously not to directly tackle the main task of the competition, i.e. sample efficiency through the usage of an offline dataset, but instead is to study basic components we might need in the future. In this post, for instance, we study a DQN solution for the navigation task. This can provide us insights on:

  • Using observations from the complex observation space
  • Taking complex and composable actions
  • Querying of the dataset for transitions
  • Network architectures and relative pre-training
  • Having an idea of the complexity of the task without the dataset

In the navigation task, we want to reach a goal position (a diamond block on the ground), which is 64 blocks away from the player’s starting position. The agent can observe the angle of a compass, pointing to a block near the goal, randomly choosen in a radius of 8 blocks. So, it is essential to use the compass to reach this spot, while reaching the goal block requires to use the camera observation. To study some of the issues above, we simplified the navigation task even more, creating an environment which tries to isolate specific features. In this post, we are going to impose:

  • Flat world: the world is all on the same level, so we can avoid using some actions (i.e. jump and camera pitch).
  • Deterministic goal: the compass points exactly to the goal position. This allows to use only the compass angle as state representation, avoiding a complex vision subpart for now.

Results

In the following figures, we can see the training process (one single run) and the resulting policy. In this case, we only considered 2 actions: camera_yaw and forward. All the code will be available on github in a short time.

Training plots Performance during training

Test performances Replay of the episode with 2D map

Conclusion

While DQN works in this simple setting with just 2 discretized action, several tests showed how it was impossible to make a direct mapping to the complex action settings, probably because the choice of action composition in DQN did not scale well with the action space dimension (possible idea: pointer networks?). In fact, all the recent baselines that are being released by the organizers are discretizations and serializations of the action space (i.e. action are taken one at a time, and some are fixed). Instead, I want to try a different path, which is trying to model all these actions together. I will present a solution for the (almost) standard NavigateDense task in the next post, using A2C (a policy gradient method).

Bibliography

  1. Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” Nature 518.7540 (2015): 529. 

Written on June 18, 2019