I am a senior research scientist at DeepMind.

Previously I was a research scientist at Lyric Labs (Analog Devices), and before that, a PhD student at the Operations Research Center at MIT.

## Research Interests

I am interested in machine learning and artificial intelligence, deep learning, reinforcement learning and model-based RL, probabilistic modeling (and probabilistic programming), and variational methods for inference, among other topics.

## Publications and Working Papers

Imagination-Augmented Agents for Deep Reinforcement Learning, with Sebastien Racaniere, David P. Reichert, Lars Buesing, Arthur Guez, Adria Puigdomenech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, David Silver, and Daan Wierstra
[abstract]
[paper].

**Abstract:**

We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep reinforcement learning combining model-free and model-based aspects. In contrast to most existing model-based reinforcement learning and planning methods, which prescribe how a model should be used to arrive at a policy, I2As learn to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. I2As show improved data efficiency, performance, and robustness tomodel misspecification compared to several baselines.

Learning model-based planning from scratch, with Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racaniere, David Reichert, Daan Wierstra, and Peter Battaglia
[abstract]
[paper].

**Abstract:**

Conventional wisdom holds that model-based planning is a powerful approach to sequential decision-making. It is often very challenging in practice, however, because while a model can be used to evaluate a plan, it does not prescribe how to construct a plan. Here we introduce the “Imagination-based Planner”, the first model-based, sequential decision-making agent that can learn to construct, evaluate, and execute plans. Before any action, it can perform a variable number of imagination steps, which involve proposing an imagined action and evaluating it with its model-based imagination. All imagined actions and outcomes are aggregated, iteratively, into a “plan context” which conditions future real and imagined actions. The agent can even decide how to imagine: testing out alternative imagined actions, chaining sequences of actions together, or building a more complex “imagination tree” by navigating flexibly among the previously imagined states using a learned policy. And our agent can learn to plan economically, jointly optimizing for external rewards and computational costs associated with using its imagination. We show that our architecture can learn to solve a challenging continuous control problem, and also learn elaborate planning strategies in a discrete maze-solving task. Our work opens a new direction toward learning the components of a model-based planning system and how to use them.

Visual Interaction Networks, with Nicholas Watters, Andrea Tacchetti, Razvan Pascanu, Peter Battaglia and Daniel Zoran,
[abstract]
[paper].

**Abstract:**

From just a glance, humans can make rich predictions about the future state of a wide range of physical systems. On the other hand, modern approaches from engineering, robotics, and graphics are often restricted to narrow domains and require direct measurements of the underlying states. We introduce the Visual Interaction Network, a general-purpose model for learning the dynamics of a physical system from raw visual observations. Our model consists of a perceptual front-end based on convolutional neural networks and a dynamics predictor based on interaction networks. Through joint training, the perceptual front-end learns to parse a dynamic visual scene into a set of factored latent object representations. The dynamics predictor learns to roll these states forward in time by computing their interactions and dynamics, producing a predicted physical trajectory of arbitrary length. We found that from just six input video frames the Visual Interaction Network can generate accurate future trajectories of hundreds of time steps on a wide range of physical systems. Our model can also be applied to scenes with invisible objects, inferring their future states from their effects on the visible objects, and can implicitly infer the unknown mass of objects. Our results demonstrate that the perceptual module and the object-based dynamics predictor module can induce factored latent representations that support accurate dynamical predictions. This work opens new opportunities for model-based decision-making and planning from raw sensory observations in complex physical environments.

**Abstract:**

Choosing the parameters of a probability distribution in order to minimize an expected loss is the central problem in many machine learning applications, including inference in latent variable models or policy search in reinforcement learning. In most cases however, the exact gradient of the expected loss is not available as a closed-form expression. This has led to the development of flexible gradient estimators, with a focus on the conflicting objectives of being both general-purpose and low-variance. If the loss is non-differentiable, as is the case in reinforcement learning, or if the distribution is discrete, as for probabilistic models with discrete latent variables, we have to resort to score-function (SF) gradient estimators. Naive SF estimators have high variance and therefore require sophisticated variance reduction techniques, such as baseline models, to render them effective in practice. Here we show that under certain symmetry and parametric assumptions on the distribution, one can derive unbiased stochastic gradient estimators based on finite differences (FD) of the loss function. These estimators do not require learning baseline models and potentially have less variance. Furthermore, we highlight connections of the FD estimators to simultaneous perturbation sensitivity analysis (SPSA), as well as weak derivative and “straight-through” gradient estimators.

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models, with Ali Eslami, Nicolas Heess, Yuval Tassa, David Szepesvari, Koray Kavukcuoglu, and Geoffrey Hinton,* NIPS 2016*
[abstract]
[paper].

**Abstract:**

We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects – counting, locating and classifying the elements of a scene – without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network at unprecedented speed. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.

Gradient Estimation Using Stochastic Computation Grapsh, with John Schulman, Nicolas Heess, and Pieter Abbeel,* NIPS 2015*
[abstract]
[paper].

**Abstract:**

In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce the formalism of stochastic computation graphs—directed acyclic graphs that include both deterministic functions and conditional probability distributions—and describe how to easily and automatically derive an unbiased estimator of the loss function’s gradient. The resulting algorithm for computing the gradient estimator is a simple modification of the standard backpropagation algorithm. The generic scheme we propose unifies estimators derived in variety of prior work, along with variance-reduction techniques therein. It could assist researchers in developing intricate models involving a combination of stochastic and deterministic operations, enabling, for example, attention, memory, and control actions.

Deep Reinforcement Learning in Large Discrete Action Spaces, with Gabriel Dulac-Arnold, Richard Evans, Hado Van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Thomas Degris, and Ben Coppin,
[abstract]
[paper].

**Abstract:**

Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-world tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to generalize over the set of actions as well as sub-linear complexity relative to the size of the set are both necessary to handle such tasks. Current approaches are not able to provide both of these, which motivates the work in this paper. Our proposed approach leverages prior information about the actions to embed them in a continuous space upon which it can generalize. Additionally, approximate nearest-neighbor methods allow for logarithmic-time lookup complexity relative to the number of actions, which is necessary for time-wise tractable training. This combined approach allows reinforcement learning methods to be applied to large-scale learning problems previously intractable with current methods. We demonstrate our algorithm’s abilities on a series of tasks having up to one million actions.

**Abstract:**

We present a new algorithm for approximate inference in probabilistic programs, based on a stochastic gradient for variational programs. This method is efficient without restrictions on the probabilistic program; it is particularly practical for distributions which are not analytically tractable, including highly structured distributions that arise in probabilistic programs. We show how to automatically derive mean-field probabilistic programs and optimize them, and demonstrate that our perspective improves inference efficiency over other algorithms.

**Abstract:**

We introduce Dimple, a fully open-source API for probabilistic modeling. Dimple allows the user to specify probabilistic models in the form of graphical models, Bayesian networks, or factor graphs, and performs inference (by automatically deriving an inference engine from a variety of algorithms) on the model. Dimple also serves as a compiler for GP5, a hardware accelerator for inference.

Random Decision Networks: Correlation Decay and Decentralized Optimization, with D. Gamarnik, * accepted to Mathematics of Operations Research * (preliminary version best submission to the 2008 Conference on Stochastic Networks).
[abstract]
[paper]
[online complement]

**Abstract:**

We consider a network of cooperating agents in which every agent has to make a decision from a finite set. The agents aim to maximize a global function f that can be decomposed as a sum of local cost functions, each one indicating interaction between two agents. The network is endowed with a probabilistic structure in which local cost functions are sampled from a distribution.

Our aim is to be able to characterize the efficiency of a decentralized solution generated on the basis of local information. More precisely, we suppose that each agent knows the structure of the network and the cost functions within a distance at most r, and takes a decision with this knowledge at hand, without coordination (i.e. without knowing the decisions of other agents). Under this constraint, how suboptimal is a decentralized solution of radius r, and can it be computed efficiently?

We look at these questions in light of density evolution and the correlation decay phenomenon and, by introducing a cavity-type recursion, find sufficient conditions for correlation decay relating the local costs distribution and the degree of the graph considered.

As an example, we show the existence of a polynomial-time approximation scheme for the maximum independent set problem with maximum degree 3 and exponential random weights.

Quantifying Statistical Interdependence by Message Passing on Graphs: Algorithms and Application to Neural Signals: Part I, with J.Dauwels, F.Vialatte, T. Musha and A.Cichocki, * published in Neural Computation.*
[abstract]
[paper]

**Abstract:**

We present a novel approach to quantify the statistical interdependence of two time series, referred to as “stochastic event synchrony” (SES). As a first step, one extracts “events” from the two given time series. Next, one tries to align events from one time series with events from the other. The better the alignment, the more similar the two time series are considered to be. More precisely, the similarity is quantiﬁed by the following parameters: time delay, variance of the timing jitter, fraction of “non-coincident” events, and average similarity of the aligned events.

The pairwise alignment and SES parameters are determined by statistical inference. In particular, we alternate the following two steps: (i) one estimates the SES parameters from a given pairwise alignment; (ii) with the resulting estimates, one reﬁnes the pairwise alignment. The SES parameters are computed by maximum a posteriori (MAP) estimation (Step 1), and the pairwise alignment is obtained by applying the max-product algorithm (Step 2). SES may be applied to generic one-dimensional and multi-dimensional point processes. This paper (Part I) deals with one-dimensional point processes, the extension to multi-dimensional point processes is considered in a companion paper (Part II).

The proposed measures are compared to classical (dis)similarity measures for one-dimensional point processes such as the Victor-Purpura distance metric, the van Rossum distance metric, the Schreiber similarity measure, the Hunter-Milton similarity measure, and the event synchronization measure proposed by Quiroga. By analyzing surrogate data, it is demonstrated that SES is able quantify both timing precision and event reliability more robustly than classical measures; in fact, most classical measure are not able to satisfactorily distinguish both aspects of synchrony. Moreover, before one can apply classical measures, one usually needs to determine potential oﬀsets (delays or lags) between the point processes. SES deals with oﬀsets in a natural and direct way, and therefore, the SES similarity measures are robust to oﬀsets. As an illustration, neuronal spike data generated by the Morris-Lecar neuron model is considered.

Quantifying Statistical Interdependence by Message Passing on Graphs: Algorithms and Application to Neural Signals, Part II: multidimensional point processes, with J.Dauwels, F.Vialatte, T. Musha and A.Cichocki, * published in Neural Computation. *
[abstract]
[paper]

**Abstract:**

Stochastic event synchrony is a technique to quantify the similarity of pairs of signals. First, “events” are extracted from the two given time series. Next, one tries to align events from one time series with events from the other. The better the alignment, the more similar the two time series are considered to be. In Part I, one-dimensional events are considered, this paper (Paper II) concerns multi-dimensional events. Although the basic idea is similar, the extension to multi-dimensional point processes involves a hard combinatorial problem, and therefore, it is non-trivial.

The similarity is again quantiﬁed by the following parameters: time delay, variance of the timing jitter, fraction of “non-coincident” events, and average similarity of the aligned events. This paper mostly considers point processes in time-frequency domain, extracted from EEG signals. The average event similarity is in that case described by two parameters: the average frequency oﬀset between events in the time-frequency plane and the variance of the frequency oﬀset (“frequency jitter”). SES then consists of ﬁve parameters in total. Those parameters quantify the synchrony of oscillatory events, and provide an alternative to classical synchrony measures that quantify amplitude or phase synchrony.

Also in the multi-dimensional case, the problem of jointly computing the pairwise alignment and SES parameters is cast as a statistical inference problem. This problem is solved by coordinate descent, more speciﬁcally, by alternating the following two steps: (i) one estimates the SES parameters from a given pairwise alignment; (ii) with the resulting estimates, one reﬁnes the pairwise alignment. The SES parameters are computed by maximum a posteriori (MAP) estimation (Step 1), in analogy to the one-dimensional case. The pairwise alignment (Step 2) can no longer be obtained through dynamic programming, since the state space becomes too large. Instead it is determined by applying the max-product algorithm on a cyclic graphical model. The resulting iterative algorithm is guaranteed to converge to the optimal alignment under very mild conditions.

The method is ﬁrst applied to surrogate data in order to test its robustness and reliability. Next it is applied to detect anomalies in EEG synchrony of Mild Cognitive Impairment (MCI) patients. Numerical results suggest that SES is signiﬁcantly more sensitive to perturbations in EEG synchrony than a large variety of classical synchrony measures.

Quantifying Statistical Interdependence by Message Passing on Graphs: Algorithms and Application to Neural Signals, Part III: N>2 point processes, with J.Dauwels, F.Vialatte, T. Musha and A.Cichocki, * published in Neural Computation. *
[abstract]
[paper]

**Abstract:**

Stochastic event synchrony (SES) is a technique to quantify the similarity of pairs of signals. In this paper (Part III), SES is extended from pairs of signals to collections of signals. As in Part I and II, first “events” are extracted from the given time series. Next, one tries to align events from one time series with events from the other. The better the alignment, the more similar the collection of time series is considered to be. As in Part II, this paper deals with multi-dimensional events. Although the basic idea is similar to the pairwise case, the extension to collection of point processes involves an NP-hard combinatorial problem, and therefore, it is far from trivial.

The problem of jointly computing the alignment and SES parameters is again cast as a statistical inference problem. This problem is solved by coordinate descent, more specifically, by alternating the following two steps: (i) one estimates the SES parameters from a given alignment; (ii) with the resulting estimates, one refines the alignment. The SES parameters are computed by maximum a posteriori (MAP) estimation (Step 1), in analogy to the pairwise case. The alignment (Step 2) is solved as an integer program.

In order to test the robustness and reliability of the proposed multivariate SES method, it is first applied to surrogate data. Next it is applied to detect anomalies in EEG synchrony of Mild Cognitive Impairment (MCI) patients. The multivari- ate approach helps to further improve the diagnosis and enables a more detailed analysis.

State-relevance Weights for the Linear Programming Approach to Dynamic Programming, with D. Pucci de Farias, * in preparation * (preliminary version at CDC 2008).
[abstract]
[paper]

**Abstract:**

The linear programming approach to approximate dynamic programming returns a solution that depends on the cost vector used, a result which contrasts with the case of LP for exact dynamic programming. In the general case of linear combination of features approximating the cost-to-go function, we introduce a special operator on the space of cost vectors which we prove to have a fixed point. Solving the linear program with this cost vector yields a solution with a performance guarantee bound proportional to the strength of the architecture. We introduce a technique based on homotopy methods to find this fixed point in an efficient and flexible manner.

In the case of features defined on a partition of the state-space, we give a simpler method to find this fixed point, as well as stronger bounds associated with it. Assuming efficient constraint sampling, this yields the first polynomial-time approximate dynamic programming scheme with bounded performance loss.

Finally, we investigate the effects of combining the solutions to different cost vectors.

To Wave or Not to Wave? Order Release Policies for Warehouses with an Automated Sorter, with J. Gallien, * Mathematics of Operations Research*
[abstract]
[paper].

**Abstract:**

Wave-based release policies are prevalent in warehouses with an automated sorter, and take different forms depending on how much waves overlap and whether the sorter is split for operating purposes. Waveless release is emerging as an alternative policy adopted by an increasing number of firms. While that new policy presents several advantages relative to waves, it also involves the possibility of gridlock at the sorter. In collaboration with a large US online retailer and using an extensive dataset of detailed flow information, we first develop a model with validated predictive accuracy for its warehouses operating under a waveless release policy. We then use that model to compute operational guidelines for dynamically controlling the main parameter of its waveless policy, with the goal of maximizing throughput while keeping the risk of gridlock under a specified threshold. Secondly, we leverage that model and dataset to perform through simulation a performance comparison of wave-based and waveless policies in this context. Our waveless policy yields larger or equal throughput than the best performing wave-based policy with a lower gridlock probability in all scenarios considered. Waveless release policies thus appear to merit very serious consideration by practitioners. Facilities using a non-overlapping wave policy should also consider overlapping waves or a split sorter policy.