Summary
Gametheoretic population learning enables strategic exploration in games with convergence guarantees to solution concepts such as Nash Equilibrium (NE). However, applying such method to realworld games that require approximate bestresponse solvers (such as deep RL) does not scale, due to the difficulty in iteratively training bestresponse agent.
We propose Neural Population Learning (NeuPL), a general and efficient population learning framework that learns and represents diverse policies in symmetric zerosum games within a single conditional network via “selfplay”.
Demo
RunningwithScissors
In this section we visualize gameplays between 8 distinct policies, represented and executed by a single conditional network in a game of runningwithscissors
^{1}.
To do well in this case, players must learn to infer opponent behaviours based on its limited fieldofview (a 4x4 square in front of the player). For example, if rock
s are missing from its usual location, then the opponent must have picked up rock
s!
Training Progression through Time
We show the training progression of a neural population of policies through time, starting from a fixed intial policy that always pick up all the rock
s. Starting from the initial policy, we show that a single conditional network \(\Pi_\theta(\cdot \mid o_{<}, \sigma)\) discovered and represented a set of 8 distinct policies, each bestresponding to combinations of others.
Visualisation of Learned Policies
If you click on a cell in the payoff matrix, an example episode between the pair of policies would be shown. You can step through the episode and observe how the two players’ inventories change over time as well as their respective, partial view of the environment.
2vs2 MuJoCo Football
We investigate NeuPL in the physically simulated multiagent environment of 2vs2 MuJoCo Football^{2}, using the 3 DoF BoxHead walkers.
Training Progression through Time
Through time, a sequence of bestresponses emerged with policy \(\Pi_\theta(\cdot \mid o_{<}, \sigma_2)\) mastering rapid longrange shots and \(\Pi_\theta(\cdot \mid o_{<}, \sigma_3)\) developing coordinated team play. Checkout the gameplay on the right hand side (blue (3)  vs red (2)
) in action!
Example Games
We note that beyond policy (3), it becomes difficult to tell different policies apart as they all are highly skilled and perform coordinated teamplay. This is not surprising. MuJoCo Football is a fullyobserved environemnt that affords prominent transitive skill dimensions but comparatively muted strategic cyles. In this case, NeuPL automatically reduces to a learning regime similar to that of selfplay, which is optimal in purely transitive games^{3}.
MatchUp  Visuals 

blue (2)  vs red (1) 

blue (3)  vs red (2) 

blue (4)  vs red (3) 

blue (5)  vs red (4) 
Citation
@inproceedings{
liu2022neupl,
title={Neu{PL}: Neural Population Learning},
author={Siqi Liu and Luke Marris and Daniel Hennes and Josh Merel and Nicolas Heess and Thore Graepel},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=MIX3fJkl_1}
}
References:

Vezhnevets, A., Wu, Y., Eckstein, M., Leblond, R. & Leibo, J.Z.. (2020). OPtions as REsponses: Grounding behavioural hierarchies in multiagent reinforcement learning. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:97339742 Available from https://proceedings.mlr.press/v119/vezhnevets20a.html. ↩

Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., & Graepel, T. (2018, September). Emergent Coordination Through Competition. In International Conference on Learning Representations. ↩

Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., & Graepel, T. (2019, May). Openended learning in symmetric zerosum games. In International Conference on Machine Learning (pp. 434443). PMLR. ↩