Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems Goyal, Anirudh, Lamb, Alex, Gampa, Phanideep, Beaudoin, Philippe, Levine, Sergey, Blundell, Charles, Bengio, Yoshua, and Mozer, Michael 2020
Modeling a structured, dynamic environment like a video game requires keeping track of the objects and their states (declarative knowledge) as well as predicting how objects behave (procedural knowledge). Black-box models with a monolithic hidden state often lack systematicity: they fail to apply procedural knowledge consistently and uniformly. For example, in a video game, correct prediction of one enemy’s trajectory does not ensure correct prediction of another’s. We address this issue via an architecture that factorizes declarative and procedural knowledge and that imposes modularity within each form of knowledge. The architecture consists of active modules called object files that maintain the state of a single object and invoke passive external knowledge sources called schemata that prescribe state updates. To use a video game as an illustration, two enemies of the same type will share schemata but will each have their own object file to encode their distinct state (e.g., health, position). We propose to use attention to control the determination of which object files to update, the selection of schemata, and the propagation of information between object files. The resulting architecture is a drop-in replacement conforming to the same input-output interface as normal recurrent networks (e.g., LSTM, GRU) yet achieves substantially better generalization on environments that have factorized declarative and procedural knowledge, including a challenging intuitive physics benchmark
BanditRank: Learning to Rank Using Contextual Bandits Gampa, Phanideep, and Fujita, Sumio arXiv e-prints 2019
We propose an extensible deep learning method that uses reinforcement learning to train neural networks for offline ranking in information retrieval (IR). We call our method BanditRank as it treats ranking as a contextual bandit problem. In the domain of learning to rank for IR, current deep learning models are trained on objective functions different from the measures they are evaluated on. Since most evaluation measures are discrete quantities, they cannot be leveraged by directly using gradient descent algorithms without an approximation. BanditRank bridges this gap by directly optimizing a task-specific measure, such as mean average precision (MAP), using gradient descent. Specifically, a contextual bandit whose action is to rank input documents is trained using a policy gradient algorithm to directly maximize the reward. The reward can be a single measure, such as MAP, or a combination of several measures. The notion of ranking is also inherent in BanditRank, similar to the current \textitlistwise approaches. To evaluate the effectiveness of BanditRank, we conducted a series of experiments on datasets related to three different tasks, i.e., web search, community, and factoid question answering. We found that it performs better than state-of-the-art methods when applied on the question answering datasets. On the web search dataset, we found that BanditRank performed better than four strong listwise baselines including LambdaMART, AdaRank, ListNet and Coordinate Ascent.
A Tractable Algorithm For Finite-Horizon Continuous Reinforcement Learning Gampa, Phanideep, Satwik Kondamudi, Sairam, and Kailasam, Lakshmanan International Conference on Intelligent Autonomous Systems (ICoIAS), Singapore 2019
We consider the finite horizon continuous reinforcement learning problem. Our contribution is three-fold. First,we give a tractable algorithm based on optimistic value iteration for the problem. Next,we give a lower bound on regret of order Ω(T^2/3) for any algorithm that discretizes the state space, improving the previous regret bound of Ω(T^1/2) of Ortner and Ryabko \citecontrl for the same problem. Next,under the assumption that the rewards and transitions are Hölder Continuous we show that the upper bound on the discretization error is const.Ln^-αT. Finally,we give some simple experiments to validate our propositions.