Exploration and preference satisfaction trade-off
in reward-free learning
arXiv., 2021 | poster

Motivations: We have meaningful interactions with our environment even in the absence of a reward signal. This is via learning of preferred modes of behaviour that lead to predictable states e.g., repeatedly going to the same sushi place! In this paper, we pursue the notion that this learnt behaviour can be a consequence of reward-free preference learning that ensures an appropriate trade-off between exploration and preference satisfaction.

Pepper: To this end, we present pepper; a preference learning mechanism that can accumulate preferences using conjugate priors – given a model-based Bayesian agent. These conjugate priors are used to augment the planning objective (here, we use the expected free energy) for learning preferences over states (or outcomes) across time. Importantly, pepper enables the agent to learn preferences that encourage adaptive behaviour at test time.

Numerical analysis: We illustrate this in the OpenAI Gym FrozenLake and the 3D mini-world environments -- with and without volatility.

In a non-volatile environment, Pepper agents learn confident (i.e., precise) preferences and act to satisfy them:

In a volatile setting, perpetual preference uncertainty maintains exploratory behaviour:

Our experiments suggest that learnable (reward-free) preferences entail a trade-off between exploration and preference satisfaction.

Exploration and preference satisfaction trade-off

Pepper agents Bayes-optimally trade-off between exploration and preference satisfaction. We measured this using the Hausdorff distance(see figure below). Here, high Hausdorff distance denotes increased exploration and low distance entails prior preference satisfaction. We see a u-shaped association between volatility in the environment and preference satisfaction! >>> paper for more details

Takeaway: Pepper offers a straightforward framework suitable for designing adaptive agents when reward functions cannot be predefined as in real environments.



We thank Fatima Sajid for reviewing the manuscript. NS acknowledges funding from the Medical Research Council, UK (MR/S502522/1). PT is supportedby the UK EPSRC CDT in Autonomous Intelligent Machines and Systems (grant refer-ence EP/L015897/1). KJF is funded by the Wellcome Trust (Ref: 203147/Z/16/Z and205103/Z/16/Z).
The website template was borrowed from Michaël Gharbi.