Stochastic Learnings

Random walk in the land of machine learning, algorithms, and software engineering.

Follow me on GitHub

My impressions from NeurIPS 2018 conference

My team has a great tradition: everyone who goes to a major conference must write a report about it and share his/her impressions with the team. In that report we usually list the most interesting papers to be studied later in our ML reading group meetings. So the list below is effectively my study plan for the next 6 months.

Day 0: Sun Dec 2: Expo

This year, the conference started a day early with the Expo session, centered around the vendors and sponsors. Very nice move - most of the attendants had enough time to register (and grab some swag) before the main session starts. It was a problem last year, when I had to stay in line for registration for almost 2 hours.

Day 1: Mon Dec 3: Tutorials

This was the most interesting day for me :) I went to 3 tutorials:

Scalable Bayesian Inference

David Dunson gave a nice talk on large-scale Bayesian ML. He covered all familiar topics, like variational inference, MCMC and such, but with some nice modern techniques, e.g. using Wasserstein distance to aggregate the posteriors in distributed setting, or speeding up the MCMC by replacing expensive transition kernels with approximations. In the second part of the tutorial he also talked a lot about (distributed) Bayesian learning with the large number of features.

Negative Dependence, Stable Polynomials, and All That

That was the coolest tutorial of the whole conference. Just like with the Optimal Transport last year, it was a discovery of the new area of math that can be applied for machine learning. (New for me, that is). The theory of negative dependence allows modeling of the non-iid data, and the tutorial introduces an amazing toolbox based on Determinantal Point Processes. I am actively studying the area now, and plan to give a little tutorial on it for the team soon.

Statistical Learning Theory: a Hitchhiker’s Guide

That was a nice outline of what’s going on in the SLT world, starting with the basics (VC dimensions, PAC learning, etc.) all the way to PAC-Bayes and work to apply SLT to deep learning and all models with extremely large hypothesis spaces. In all, I found the tutorial a bit hard to follow – maybe I should re-watch the video and read more literature on the subject. PAC-Bayes seems the most interesting topic for me.

Day 2: Tue Dec 4: Main session

Lots of interesting talks. I went to Track 2 in the morning and to Track 3 after lunch. Here are the most interesting bits:

Test of Time Award

Test of time award goes to The Tradeoffs of Large Scale Learning by Leon Bottou and Olivier Bousquet.

Spectral Filtering for General Linear Dynamical Systems

A neat combination of two well-known techniques. The authors apply spectral filtering to learn the latent-state LDS in polynomial time.

Neural Ordinary Differential Equations

That work got the Best Paper award. The authors use a black-box ODE solver to represent a continuous-depth DNN layers.

How Does Batch Normalization Help Optimization?

Nice study of BatchNorm effects. The authors show that, despite popular belief, BatchNorm does not reduce the internal covariate shift.

Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements

(poster). A topic that is near and dear to my heart: combining Count-Min and FA sketches in the distributed setting.

Day 3: Wed Dec 5: Main session

I’ve started with Track 3 (optimization), and then moved to Track 2. Here are some highlights of the day:

Leveraged volume sampling for linear regression

The authors review different sampling schemes to learn a linear model, and propose a novel sampling scheme that is both unbiased and has a good tail bound.

Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes

Best Paper award. Nice technique for distribution learning, applied to learn a mixture of Gaussians. Looks interesting for distributed learning.

Minimax Statistical Learning with Wasserstein distances

The authors use Optimal Transport ideas in the ERM setting. Looks like pretty deep theoretical work. Need to do my homework to really understand the paper.

P.S. It looks like a general trend at the conference to formulate optimization tasks in ML as minimax problems.

Stochastic Cubic Regularization for Fast Nonconvex Optimization

The paper proposes a stochastic variant of a classic cubic-regularized Newton method.

Isolating Sources of Disentanglement in Variational Autoencoders

The goal of disentanglement is to learn a representation that has distinct meaning in each dimension. The authors use a novel technique to learn such disentangled representation in VAE. Really nice talk.

GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration

Gaussians Processes in PyTorch - on GPU.

Day 4: Thu Dec 6: Main session

I started in Track 1 and moved to Track 3 after lunch. I was lucky to hit yet another Best Paper oral.

Learning with SGD and Random Features

Nice study how the hyperparameters impact the behavior of the SGD with minibatch and random features.

KONG: Kernels for ordered-neighborhood graphs

Novel graph kernels for graphs with ordered neighbors.

Optimal Algorithms for Non-Smooth Distributed Optimization in Networks

Best Paper award. One of the authors is from MSR. The authors propose an efficient smoothing technique for large-scale distributed optimization of non-smooth convex functions.

Automatic differentiation in ML: Where we are and where we should be going

Authors present an experimental differentiable programming language, Miya. It is based on Python and is backed by TVM.

Wasserstein Distributionally Robust Kalman Filtering

A study of an MSE estimation problem for a nonconvex Wasserstein ambiguity set of Gaussians. Need to wrap my brain around it.

Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters

The authors develop a new accelerated primal-dual SGD with linear equality constraints and use it to approximate the Wasserstein barycenter problem in the distributed setting.

A Bayesian Nonparametric View on Count-Min Sketch

(poster) Very nice Bayesian probabilistic interpretation of the classical Count-Min sketch data structure. Good theory; one of the authors is Michael Mitzenmacher (author of the “Probability and Computing” book).

Day 5: Fri Dec 7: Workshops

I’ve started the day at the Bayesian Deep Learning workshop, then stopped by the Compact DNN Representation and GAN workshops and stayed at the MLSys in the afternoon (where I also manned our ML.NET poster).

Bayesian Deep Learning workshop

Estimating uncertainty for model-based reinforcement learning

An invited talk by Harri Valpola. Pretty interesting, but unfortunately I cannot find any materials for it online.

A Unifying View of Bayesian Continual Learning

Authors combine prior- and likelihood- based approaches to Bayesian continual learning.

Control as Inference: a Connection Between Reinforcement Learning and Graphical Models

Invited talk by Sergey Levine; unfortunately, I did not attend it.

There is a related tutorial available online, though.

Compact DNN Representation workshop

NN Compression in the wild

Invited talk. Pretty interesting review of NN compression techniques from Tim Genewein. Most are from the Bayesian perspective. No materials available so far.

Linear Backprop in non-linear networks

An idea of using linear backprop for non-linear (e.g. convolution/maxpool) layers sounds quite strange, but the author shows that it actually work as the regularization – and can significantly speed up the training. I wonder if it can be applied for even more complex layers (like capsules).

Smooth Games Optimization and Machine Learning Workshop

Smooth Games in Machine Learning Beyond GANs

Invited talk by Niao He. She shows how a multitude of ML tasks can be reduced to saddle point and minimax optimization problems. Very cool talk; no materials available online.

MLSys: Workshop on Systems for ML and Open Source Software

Machine Learning at Netflix

A keynote, centered around functional programming and strongly typed languages – mostly using Scala typeclasses (in turn, borrowed from Haskell). No materials available, but there are many variants of this talk at other venues, dating from 2014 and on. Search for Aish Fenton Machine Learning at Netflix.

Parallel training of linear models

Authors train GLM in the distributed setting using SDCA and apply a bunch of usual tricks to speed up the training – at the expense of convergence rate. They subsequently fix the convergence problem by employing some novel data partitioning scheme.

Demo: TVM

End-to-end compiler and IR for Deep learning from UW. Looks very promising and relevant for ML.NET/ONNX/Torch.NET backend. Lots of materials available at their site.

Demo: MLFlow

An infrastructure (Python API + servers) for reproducible ML lifecycle from Matei Zaharia and Databricks. The project is open source and can be hosted in-house, but of course Databricks will sell their own hosting.

A Case for Serverless Machine Learning

Author forces ML tasks on Amazon Lambda framework and reinvents Apache REEF along the way. Don’t do that, people.

Day 6: Sat Dec 8: Workshops

I mostly stayed in the Integration of Deep Learning Theories workshop, and went to ML for Systems to see Jeff Dean later in the evening.

Integration of Deep Learning Theories workshop

On the Margin Theory of Feedforward Neural Networks

Contributed talk; the authors explain how overparameterization can improve generalization in NNs. They use kernel-based representation of infinite-neuron NN layer.

Towards a Definition of Disentangled Representations

Really nice invited talk connecting abstract algebra and group theory with ML. No materials available yet.

Neural Rendering Model: Joint Generation and Prediction for Semi-Supervised Learning

NRM is a generative model with inference computations that mimic CNNs. That is, they reverse the CNN and build the corresponding generative model out of it. The result is an effective framework for semi-supervised learning that combines generation and prediction in end-to-end optimization. Lots of theory, need to wrap my brain around it.

ML for Systems workshop

Machine Learning for Systems

Keynote by Jeff Dean. Not much different from his last year’s talk.