There is a number of things that are worth noting if you find yourself in Brussels for the first time. What definitely makes an impression though is that Belgians love praline, you can find waffles everywhere and, inexplicably, as soon as you find yourself in the city you develop a strong crave for seafood—especially for mussels (maybe it’s because it rimes with Brussels?)! In any case, Brussels was chosen for hosting this year’s Apache Spark Summit and I was fortunate to attend the event.
The summit is one of those conferences where one really has trouble choosing between all the available options of talks to attend (unless you are Dr Manhatan and can multiply yourself infinitely with no trouble). In any case, a number of things caught my attention and I felt like they are worth blogging about.
AMPLab: Mission accomplished; time for RISELab
The conference kicked-off with a series of keynotes. Amongst them Ion Stoica, Executive Chairman of Databricks and Professor of Computer Science at UC Berkeley, who announced the end of AMPLab’s mission; a mission that lasted for five years (2011-16). AMPLab's mission was simple: "To make sense of Big Data" and that it did. With a staff of 8 faculty members and over 60 students, the next generation of open source data analytics stack was developed to be used by both the industry and academia. Now the time has come for setting a new mission; one to be brought about by the RISELab.
Whereas AMPLab's objective was to "get us from batch data to advanced analytics", RISELab intends to help us "get to real-time decisions from live data" while at the same time do so with strong security. It is a fact that real-time processing is becoming more and more popular these days, so it comes as no surprise that a large aspect of the conference oriented around stream-processing. Two aspects were presented as the main challenges: tackling lower latency issues that are evident in the micro-batching approach compared to record at a time approaches (e.g. Apache Flink) and; allow for full data encryption, authentication and computation verification in hardware enclave in order to reduce the threat exposures, but doing so without compromising computation efficiency. For each, respective systems are already being developed, Drizzle and Opaque, for which some early results were presented. You can find more details here.
In any case, the takeaway message that I got from Ion's talk is that Spark is entering the stream-processing scenery with challenging intentions and this can only affect the overall community in a positive way, especially in relation to the development of robust, on-line Machine Learning algorithms.
I am not going to go as far as to claim that whatever I chose to present in this blog lies at the pinnacle of all talks presented at the conference. However, the talks below did make an impression on me and felt like sharing it.
Data-Aware Spark (Research)
This talk by Zoltan Zvara drew my attention. In a nutshell, It was a research talk about how there are cases where Spark's default hashing feature does not achieve a uniform data distribution within the cluster and how that could be improved. As it was explained, a highly skewed key distribution can give rise to issues like a number of partitions (unknown) containing too many records on the reducer side and the consequent appearance of slow performance tasks. Solving this issue is, of course, related to identifying the data distribution in advance but this is not always known. Zoltan's aim, as he puts it, is to make Spark 'data-aware'. That is to allow Spark to analyse data on the fly and take action as to how data should be partitioned so as to not compromise the platform's efficiency, with as little or no user guidance if possible.
If I understood the whole thing right, it can be broken down into two main parts. The one is scalable sampling from the perspective of the Driver and the Executors which relies on various quantifiable metrics such as the degree of the reducer run-time correlation with the computational complexity of a given key, shuffling and repartitioning information. This part is followed by a decision to re-partition the data which is in turn followed by the construction of a new hash-function.
Though the new hash-function is more complex than hash-code, thus adding a small overhead (1%-8%), a more uniform partitioning is achieved which can cut runtime at reducer-stage in half. Furthermore, as far as the user-guided approach is concerned, Zoltan is also working on execution and re-partition visualisations aiming towards assisting developers in identifying possible issues and bottlenecks of certain workloads. In overall, a very interesting talk with an obvious impact and with useful recommendations for some fundamental improvements.
VEGAS The missing MATPlotLIb FOR SPARK (Data Science)
For those of you who love Scala but have to resort to Python every time you want to visualise something, worry not, because this twin from Netflix has you covered. Aish Fenton and Sudeep Das presented a number of techniques for visualising large scale machine learning data in Spark. Focussing on a case-based approach developed and employed at Netflix to counter the growing demands after the company entered the global market, the twin illustrated how through these visualisations one can better understand and refine learning models such as those used by Netflix's famous recommender systems.
At the same time, what becomes obvious, as is always the case with visualisations, is the ease of interpretation since results become more amenable to human understanding allowing also for the detection of anomalies and inconsistencies. And all that in Scala, as Vegas wraps around Vega-Lite while providing a more familiar, type-checked syntax for use within Scala.
You can check their github repo for more details.
BETTER TOGETHER: FAST DATA WITH APACHE SPARK AND APACHE IGNITE (SPARK ECOSYSTEM)
Last but not least was a talk by Christos Erotocritou from GridGrain Systems who presented another Apache Project that goes hand in hand with Spark; Apache Ignite. The argument that Christos was making is that Ignite can help with improving Spark's performance in real-time applications. Effectively, the Ignite project orients around the use of IgniteRDD which are an implementation of native Spark RDD and DataFrame APIs and which allows for sharing RDD state across Spark workers.
A key aspect of the talk was how IgniteRDD allows for faster SQL queries by leveraging indexes. Here's my understanding of it. As it is well-known, Spark does not support SQL indexes as it is not a data management system. Another reason is that given that data are only processed and are not owned by the system, possible changes cannot be monitored and thus maintaining indices, at least in some contexts, is deemed meaningless. But in the case where a data source supports indexing, it can be utilised by Spark through IgniteRDD.
IgniteRDD supports SQL in-memory indexing, effectively introducing significant improvements in wait times, especially when running many queries within the same Spark application. It is, of course, the case that this increases the amount of necessary memory space required—in short, a classic example of a space-time tradeoff. So, to deal with the added requirements in space as well as other issues related to heap-memory limitations, Ignite allows off-heap memory data storing, which means that one can avoid garbage collection overhead which is another advantage as long as optimised, alternative memory management procedures are in place.
The Ignite project has its own growing community in London where it organises meetups regularly. Have a look here in case you are interested.
So, that's all for now. I will be back soon with more details from Day #2. In the meantime I leave you with a video from the VEGAS talk with some interesting graphs: