It is a dilemma as old as time. Friday night has arrived and you are trying to choose a restaurant for dinner. Should you visit your favorite watering hole or try a new establishment, hoping to discover something better? Potentially, but with that curiosity comes a risk: if you explore the new option, the food could be worse. On the other hand, if you stick to what you know works well, you won’t go out of your way.
Curiosity drives artificial intelligence to explore the world, now in limitless use cases: autonomous navigation, robotic decision making, optimization of health outcomes, and more. Machines, in some cases, use “reinforcement learning” to achieve a goal, where an AI agent learns iteratively by being rewarded for good behavior and punished for bad. Much like the dilemma humans face when selecting a restaurant, these agents also struggle to balance time spent discovering better actions (exploration) and time spent performing actions that led to great rewards in the past (exploitation). Too much curiosity can distract the agent from making good decisions, while too little means the agent will never discover good decisions.
In the quest to create AI agents with just the right dose of curiosity, researchers at MIT’s Improbable AI Laboratory and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) created an algorithm that overcomes the problem that AI is too “curious” and gets distracted by a given task. Its algorithm automatically increases curiosity when necessary and suppresses it if the agent receives enough supervision of the environment to know what to do.
When tested in more than 60 video games, the algorithm was able to succeed in both difficult and easy exploration tasks, where previous algorithms had only been able to tackle a difficult or easy domain on their own. With this method, AI agents use less data to learn the decision-making rules that maximize incentives.
“If you get the exploration-exploitation trade-off right, you can learn the right decision-making rules faster, and anything less will require a lot of data, which could mean suboptimal medical treatments, lower profits for websites and bots.” that they don’t. Don’t learn to do the right thing,” says Pulkit Agrawal, an assistant professor of electrical engineering and computer science (EECS) at MIT, director of the Improbable AI Lab, and a CSAIL affiliate who oversaw the research. “Imagine a website trying to figure out the design or layout of its content that will maximize sales. If one does not explore-exploit well, convergence to the right website design or the right website design will take a lot of time, which means lost profits. Or in a healthcare setting, like with Covid-19, there may be a sequence of decisions that need to be made to treat a patient, and if you want to use decision-making algorithms, they need to learn quickly and efficiently – you don’t want a suboptimal solution when treating a large number of patients. We hope that this work will be applied to real-world problems of that nature.”
It is difficult to encompass the nuances of the psychological underpinnings of curiosity; the underlying neural correlates of challenge-seeking behavior are a poorly understood phenomenon. Attempts to categorize behavior have encompassed studies that delved into our drives, sensitivities to deprivation, and social and stress tolerances.
With reinforcement learning, this process is emotionally “trimmed” and minimized, but it is technically complicated. Essentially, the agent should only be curious when there isn’t enough supervision available to try different things, and if there is supervision, it should adjust the curiosity and reduce it.
Given that a large subset of games are small agents running around fantastical environments in search of rewards and performing a long sequence of actions to achieve some goal, it seemed like the logical test bed for the researchers’ algorithm. In experiments, the researchers divided games like “Mario Kart” and “Moctezuma’s Revenge” into two different segments: one where supervision was poor, meaning the agent had less guidance, which were considered “difficult” exploration games, and a second where the supervision was more dense, or the “easy” exploration games.
Let’s say in “Mario Kart,” for example, you just kill all the bounties so you don’t know when an enemy kills you. You don’t get any reward when you collect a coin or jump on pipes. The agent is only told at the end how well he did. This would be a case of poor supervision. Algorithms that encourage curiosity work very well in this scenario.
But now, let’s say the agent is given heavy supervision: a reward for jumping over pipes, collecting coins, and taking out enemies. Here, an algorithm without curiosity works very well because it is frequently rewarded. But if you instead take the algorithm that also uses curiosity, it learns slowly. This is because the curious agent can try to run fast in different ways, dance, go to every part of the game screen, things that are interesting, but don’t help the agent to be successful in the game. However, the team’s algorithm consistently performed well, regardless of the environment it was in.
Future work could involve returning to the exploration that has delighted and tormented psychologists for years: an appropriate metric for curiosity: No one really knows the correct way to mathematically define curiosity.
“Getting consistently good performance on a new problem is extremely challenging, so by improving exploration algorithms, we can save you the effort of fine-tuning an algorithm for your problems of interest,” says Zhang-Wei Hong, PhD student at EECS , affiliate of CSAIL and company. -lead author along with Eric Chen ’20, MEng ’21 on a new article about the work. “We need curiosity to solve extremely challenging problems, but on some problems it can hurt performance. We propose an algorithm that removes the burden of adjusting the balance of exploration and exploitation. What used to take, for example, a week to successfully solve the problem, with this new algorithm we can obtain satisfactory results in a few hours”.
“One of the biggest challenges for AI and cognitive science today is how to balance exploration and exploitation: information seeking versus reward seeking. Kids do this without a problem, but it’s computationally challenging,” says Alison Gopnik, a professor of psychology and affiliated professor of philosophy at the University of California, Berkeley, who was not involved in the project. “This paper uses impressive new techniques to accomplish this automatically, designing an agent that can systematically balance curiosity about the world and the desire for reward. [thus taking] another step in making AI agents (almost) as smart as children.”
“Intrinsic rewards like curiosity are critical in guiding agents to discover useful and diverse behaviors, but this shouldn’t come at the cost of doing well on the assigned task. This is a major problem in AI, and the paper provides a way to balance that tradeoff,” adds Deepak Pathak, an assistant professor at Carnegie Mellon University, who was also not involved in the work. “It would be interesting to see how these methods go beyond games to real-world robotic agents.”
Chen, Hong and Agrawal co-wrote the paper with Joni Pajarinen, assistant professor at Aalto University and research leader in the Intelligent Autonomous Systems Group at TU Darmstadt. The research was supported, in part, by the MIT-IBM Watson AI Lab, the DARPA Machine Common Sense Program, the United States Air Force Research Laboratory’s Army Research Office, and the Air Force Artificial Intelligence Accelerator. from United States. The paper will be presented at Neural Information and Processing Systems (NeurIPS) 2022.