If you are passionate about data, Finsbury Square in London in May is a great place to be. This year, PyData London 2016 did not disappoint: lots of ideas shared, jokes told, and valuable connections made. In this (and my next) post I want to take a step back and recap the main topics I followed over the 3 days.
What is PyData?
PyData is the largest global community uniting users of Python for data analysis / processing / hacking / wrangling. It consists of numerous meetup groups, which gather annually for conferences in the major centres (London, Paris, Berlin, Chicago, Washington, Cologne, to name a few). The goal behind this movement is to bring Python language users together so that they can learn from each other by sharing ideas and experiences. PyData conferences are organized by NumFOCUS, a non-profit that "promotes the use of accessible and reproducible computing in science and technology".
The London conference ran over three days (6-8 May), with the first day dedicated to tutorial sessions and the other two to the conference itself. The talks were highly data-science-oriented with a pinch of other Python applications here and there. The conference presentations showed a huge variety of Python applications for data analysis in numerous fields ranging from classical statistics in finance and the social sciences, to hardcore machine learning models in computer vision. Due to a large audience and the volume of material, the tutorial sessions were quite similar to regular conference talks only more interactive and code-oriented. There was a fair amount of overlap between topics, which I suppose is a good indicator of today's hot topics in the data science community.
You can divide the schedule into roughly four main tracks: Bayesian and Frequentist Statistics, Machine Learning, Python, and Other Highlights. Going through each one of the events in the schedule (I managed to attend 22 out of 60!) would bore even me, so instead I will walk through the recurrent topics, with a focus only on the core points.
Bayesian and Frequentist Statistics
Even though there were plenty of talks from both schools, there was no 'Bayesians vs Frequentists' war. On the contrary, quite a few speakers were encouraging taking the best from the two worlds or using them hand in hand to extract the most insight from your data.
The use of Python for building Bayesian models is driven by the recent development of modules designed for this purpose; PyMC (2 and 3) and PyStan were the most frequently cited ones. The capabilities of PyMC were best demonstrated in P. Coyle's tutorial where he estimated logistic regression for predicting individual’s income using three different Python modules (Frequentist estimation was shown using Statsmodels and scikit-learn). Besides being a tutorial of Python, it was also a great comparison of the two statistics schools with a focus on the differences in their approaches to the same problem. Multiple speakers throughout the conference recognized PyMC as a convenient tool for quick designing and testing.
PyStan was broadly covered by J. Sedar who found it to be more stable and robust compared to PyMC, albeit more cumbersome to work with. He recommended the use of PyMC in the initial stages of your project and moving to PyStan when the model structure is already known. His talk was also the only one to deal with hierarchical models, i.e. models which take into account the hierarchical structure (e.g. students grouped into schools, schools grouped into cities, and cities grouped into countries) of the data and make predictions at different levels.
The topic of survival analysis (prediction of the time until an event occurs) was thoroughly discussed from both perspectives: Bayesianism and Survival Analysis introduced estimation using Bayesian techniques while Survival Analysis in Python and R covered Frequentist capabilities in both languages and also touched non-parametric approaches such as random forests. Both speeches did a good job explaining the math under the hood, giving examples of good practices, and introducing the toolkit.
As usual, traditional (generalized) linear models were a popular topic. A simple and beautiful example of a structural model for decomposition of UK housing prices based on a geographical dataset was given by P. Bracke. The idea is that property price can be decomposed into structure and land prices and the latter can be modeled using linear regression. I. Ozsvald and G. Weaver modeled allergies using logistic regression because of its interpretability. My favourite application of classical statistics was Detecting Novel Anomalies in Twitter where Poisson regression was used to identify whether an increase in the number of tweets for a given topic is a new trending topic being born. Their approach was to use a Poisson model for the number of tweets and label an event as an anomaly if the predicted probability of it happening falls below a certain threshold.
Finally for this 'track', I would like to mention the presentation by A. Nielsen. The focus of her talk was on missing observations in time series data, which is something that many data scientists have to deal with on a daily basis. Even though there is still no definite answer to this problem, she went through widely used methods for fighting with irregular time series (imputation, resampling, sample reduction, etc.) and the directions of current research. A substantial part of her talk was dedicated to extracting seasonality patterns and testing for granger causality when some data is missing.
At least a half of the talks in this category were using some flavour of neural networks (NNs). Even though most of the presentations and tutorials were quite technical and required an in-depth prior knowledge of NNs, there were three presentations introducing NN from the basics to advanced techniques:
- A gentle introduction to NN was targeted at everyone who wants to explore NNs for the first time and introduced the basic concepts, such as regression, classification, error, loss function, gradient descent, etc.
- Deep Learning Tutorial – Advanced Techniques was targeting those who already have some experience of training NNs and want to learn new tricks: convolutional NNs were introduced, Theano and Lasagne were compared, and some tips given (training using mini batches, data standardization, debugging, pipeline design, active learning).
- Introduction to Deep Learning and NLP, explains how to combine NNs with text vectorization (Word2Vec) for text analysis. Vectorization with NNs was also cleverly used by D. Rusu, who employed this technique to construct a distance measure between financial assets, which would explain their correlation. This method, when trained on Wikipedia articles, was found to perform better than graph-based distance measures.
In addition, multiple presentations demonstrated how to use NNs in real life applications. D. K. Slater walked us through the process of building an arcade bot using Tensorflow. After initially trying multi layer perceptrons, convolutional NNs, and even randomized strategies, Q-learning (whose loss function takes into account discounted gains from a player’s future actions) was found to give the best performance at playing pong. A sophisticated example of text mining was given by R. Pio Monti, who used deep Boltzmann machines which enable the user – a neuroscientist - to do a keyword search in a rich corpus of academic literature when designing neuroscience experiments. Two talks by C. Giles and E. Bell covered NN pipelines for image recognition in the fashion industry, talked about challenges in making convolutional NNs correctly identify clothing and product descriptions, and also took a look at picture representation inside the network which gave some hints about their performance and learned connections.
My personal favourite speech in the machine learning 'track' was Assessing the Quality of a Clustering by Christian Hennig. In his talk he gave an in-depth overview of the most widely used clustering methods (classical algorithms, such as K-means and hierarchical clustering, as well as more sophisticated methods, e.g. Gaussian mixture models, spectral clustering, and density-based methods) and, most importantly, discussed the clustering quality statistics, which are usually based on between-cluster separation, within-cluster homogeneity, stability, etc. However, one of the conclusions was that each clustering problem is unique so the analyst’s experience is crucial.
Classification is another problem being solved in multiple talks. Even though variations of linear models (e.g. logistic regression) were often used, support vector machines (SVMs) were reported to show best or second best (after NNs) performance and were popular among topics related to medicine. Citizen scientists F. Kelly and G. Weaver were training SVMs with radial kernels on emails and text scraped from online forums for people suffering from Alzheimer's disease with a goal to diagnose the illness. T. V. Griffiths introduced the audience to how he uses Scikit-Learn and SVMs in his PhD research to better understand the link between genetics and Schizophrenia. S. Greiss demonstrated how to use cross validation and SVMs to achieve 85% accuracy when classifying stars based on their spectra, however, 99% accuracy was achieved only with a convolutional NN implementation in Keras.
That's it for this blog post. I plan to post about the other two 'tracks' soon.
UPDATE 31 MAY 2016: Part II of this blog is now published.