Uchida Lab Connects Machine Learning Theory to Biological Brains


                    In our new study published in Nature Neuroscience, we found evidence that rodent brains use a specific form of learning called temporal difference (TD) learning. TD learning has been widely used in both animal learning models and artificial intelligence. In each case, an agent learns from an unexpected reward to repeat the actions that lead to reward.  Our results help bring together findings from animal psychology, machine learning, and neuroscience to provide a mechanistic account of how learning takes place in biological brains.
It has been well-known that animals learn from unexpected rewards, and this knowledge has become the basis for learning algorithms used in computers, including modern artificial intelligence. In these algorithms, learning is triggered not by the reward itself, but by the surprise (or “prediction error”) associated with the reward. One such algorithm, temporal difference (TD) learning, is attractive because it can explain how to associate a specific action or a specific cue with a reward, which is temporally separated from the cue. Imagine you are a hungry dog. You happen to hear a bell ringing and then get a tasty reward. After repeated experiences of the bell sound and reward, you gradually predict the reward just by hearing a bell ringing. How can the brain connect the reward specifically with the sound even when those events are greatly separated in time? TD learning algorithms solve this problem by using a unique form of prediction errors called TD errors. In TD learning, TD error gradually “moves” backward in time to earlier and earlier time points over multiple experiences, and eventually bridges temporally separated events.
The TD learning model was originally inspired by animal learning that depends on surprise. However, its specific algorithm is not intuitive, and it was not known whether prediction error, or surprise, gradually “moves” in the actual brain. Researchers have proposed that dopamine signals may reinforce rewarding action (or reward prediction) in a similar manner as TD learning. To support this learning, dopamine signals have to gradually shift backwards in time—from the time you receive a tasty food reward to the time you hear a bell, for example. However, gradual shifts in dopamine activity such as this had not been observed until our study.
Our paper’s investigation began when the first author, postdoc Ryunosuke Amo, looked into activity in dopamine neurons in mice during their first day of learning to associate an odor cue with a water reward. He found a dynamical, backward-shifting dopamine activity pattern, which had been theoretically predicted but never documented. We were shocked by the clear similarity between neural activity and theory and decided to pursue this finding in detail.
We recorded the activity of dopamine neurons at a population as well as single neuron levels through an optical fiber (photometry) and two-photon microscopy, respectively. Our data set included naive mice that were being trained to associate an odor cue with water reward for the first time and another group of mice that were repeatedly trained to associate odor cues with the reward. We found that the dopamine activity temporally shifts from the time point of water reward to the time point of an odor cue gradually through intermediate time points in the naive animals. On the other hand, it was difficult to detect the temporal shift in the latter condition, although we still observed it in a careful analysis of individual animals. The difficulty in detecting the temporal shift is probably due to complexity of the dopamine activity pattern in well-trained animals, who tend to generalize odors (any novel odor might be potentially predictive of reward), and this explains why previous studies did not detect the temporal shift.
While TD learning has been incorporated in various learning theories, whether brains use similar learning mechanisms is still debated. Our observation connected the theories and brain, providing stronger evidence that TD learning is implemented in the biological brain using dopamine signals as TD errors. The link between a learning algorithm and dopamine will both promote our understanding of how the brain learns and how dopamine functions. For example, we believe that more scientists can move on to search for the actual circuit mechanism for how TD learning might be happening in the brain.
Another contribution is that we provided an algorithmic framework for understanding how dopamine facilitates learning. Modulators of dopamine are widely used to treat various learning deficits such as ADHD, although the mechanism is not fully understood. We hope that the algorithmic framework gives a better clue to predict roles of dopamine in normal and abnormal behaviors.
by Mitsuko Watabe-Uchida and Diana Crow
 PDF