Here goes the second and final part of my blog post on the PyData London 2016 conference. In the first part I covered the main points from the talks on statistics and machine learning. Now we will dive into presentations on Python itself and other highlights.
The conference was attended and organised by many active committers and here's the news they had to share:
Bandicoot, a recent side project at MIT for mining mobile data, was presented by L. Rocher who is one of the authors of the toolkit. Bandicoot aims to make researchers' lives easier by providing utilities for reproducibility, management of data quality (metrics on missing data, wrong data types, etc.), data visualisation, anonymization, machine learning, and even feature extraction.
What's New in High-Performance Python? This question was answered by G. Markall. After explaining the key steps in optimising a Python program (profiling, understanding, and action) and demonstrating how to use profiling tools (Python profiler, Kernprof, and VTune) in the first part of the talk, he reviewed new features in the Numba library (parallel CUDA and Multicore functionality, JIT classes, support for CFFI, the @generate_jit decorator, and a couple more).
Aspiring data visualisation gurus attended the tutorial of Bokeh, an interactive web dashboard library comparable to R’s Shiny, which is gaining more and more attention. Compatibility with Python, R, and Julia makes it a really attractive choice for a single dashboarding tool in your pocket.
Bqplot is a lightweight visualisation library for the Jupyter notebook recently open-sourced by Bloomberg and presented during the conference by S. Corlay (one of the authors). Points worth mentioning include support of interactive plots, a variety of visualisation types, and an API which enables custom mouse interactions with your graph. Oh, and it’s based on the grammar of graphics (just like R’s ggplot).
Good Ol' Python
Of course more established Python features were also revisited. One of the conference’s starting points was a Beginner Bootcamp, which set the scene for anyone unfamiliar with data analysis in Python or even programming in general. Another must-go talk for beginners was Pandas from the Inside where dozens of examples were given for why you should start using it. Moreover, it was also relevant for Python veterans who want to understand how NumPy is used under the hood and how to speed up your DataFrame operations. A talk about the nuts and bolts of Python was given by S. Holden (GitHub) who explained the difference between an iterator and an iterable, how they are constructed behind the scenes, and gave lots of tips and tricks for how to use them properly in your code.
Gensim is a well-known collection of methods for unsupervised semantic modeling in Python, an introduction to which was given by its Community Manager Lev Konstantinovskiy. One of the reasons why this module is appealing is the distributed implementation of Latent Semantic Analysis and Latent Dirichlet Allocation algorithms, which can boost your NLP pipelines.
Another hot topic in the world of Python is Jupyter. I doubt anyone expected 3 out of 5 lightning talks to be about its integration with GitHub (and 3 different solutions were found). If the standard JSON format is not good enough for your notebooks, take a look at the talk on nbconvert by T. Kluyver and M. R. Kelley. This tool can be used to convert Jupyter notebooks into pretty much anything you might need: HTML, LaTex documents, presentations, executable Python scripts, and more. JupyterHub: Deploying Jupyter Notebooks for Students and Researchers was useful for those who want to put their notebooks on steroids.
Finally, there were talks that were related to Python but focused on a loosely related topic. For example, the conference was open for other languages as well. Julia, despite still being a fairly young project, was a topic frequently discussed at the lunch table and there were two talks dedicated to it. S. Byrne gave an overview of Julia’s functionality for number crunching and integrating it with other languages (Python, C, and R) while the tutorial by M. Sherrington focused on how to call it from Python and vice versa. He was also there for a book signing of his latest text on Julia.
Fans of graph analysis weren’t forgotten either. Probably the most interesting use case related to this technique was presented by B. Chamberlain, who explained how to query and modify a social graph in real time for market research purposes (slides). Similarly, a tool (NetworkL) for real time and super memory-efficient shortest path re-computation was presented by its author M. Bonaventura (video). Those who missed the tutorial day at GraphConnect 2016 (see the post by Christos) had another opportunity to build themselves a graph-based meetup recommendation engine using Neo4j and Python in the tutorial by M. Needham (video).
Since most of the speakers were people who actively practice data science in a business environment, data pipelines and their management was a recurring topic. Luigi, a collection of tools in Python for pipeline management, was thoroughly covered. M. Bonzanini (slides) gave some solid arguments why data scientists should stop writing “master” scripts and leave the scheduling of subtasks to Luigi instead, as well as some good practices (e.g. unit testing and breaking down the code into smaller components) and anti-patterns (e.g. bunch-of-scripts pipelines). P. Owlett took over and showed how Luigi makes their life at Deliveroo easier by translating a complicated BI reporting system into a DAG (slides).
The use of Python for processing Big Data was discussed by three speakers. U. Zink showed how to use Hadoop and Apache Zeppelin notebooks in all stages of an example pipeline: data ingestion, processing, analysis, and visualisation (video). The talk PySpark in Practice gave a concentrated shot of what Apache Spark is and how to make it push and manipulate the data (configuration, performance tuning, unit testing, machine learning, streaming, and much more). Finally, the talk by A. Zaidi (video) was more specific and shared simple but nonetheless useful learnings from his experience with Spark (the use of **kwargs and dictionaries, condition checking, etc.)
These three days were a welcome injection of knowledge and cutting-edge ideas. It would take at least a couple of months to go through all the GitHub repositories, slides, and videos to fully digest everything that I have taken from the event.
To recap, PyData is a global community uniting users of Python for everything data-related. The recent get-together, PyData London 2016, was a huge success and the largest conference so far. The hottest topics were applying Bayesian and Frequentist statistics for understanding the relationships between real life phenomena and extracting insight from the data on them, training machine learning algorithms, such as NNs, SVMs, and classifiers, to outperform humans at simple tasks, and the awesomeness of Python.