The conference’s momentum did not change on the second day and nor did my enthusiasm. I found myself in the keynotes hall as early as 8:45 in the morning in anticipation for the talks. Frankly though, the talk I was looking more forward to was the one by Andy Steinback on “The Potential of GPU-driven High-Performance Data Analytics in Spark”. Andy is a Senior Director at NVIDIA and his team is responsible for developing the deep learning market for industrial applications, so I was very interested to see what he would talk about and if he would reveal anything about what we should expect from NVIDIA in the near future.
NVIDIA & Spark: Two names we will be hearing more and more together
The talk was focussed on how Spark has emerged as the leading platform for developing in-memory distributed analytics workloads which, to some extent, can be attributed to the high-level abstractions in the plethora of programming languages (Java, Scala, Python, R, Matlab) that one can use. And this was one of the project’s primary design focus—to provide ease-of-use for application developers to build complex analytics workflows quickly in diverse operating systems thus maximising productivity. These workloads are computation- and memory- intensive, and there is room for acceleration in both types through hardware approaches. The use of GPUs comes as one such approach being that they provide high-memory bandwidth and high-computation features. NVIDIA is, of course, interested in the integration of GPUs into the Spark ecosystem. The challenge, however, is to achieve such accelerations in a seamless way, without compromising productivity.
The more specific objective is to help scale AI and High-Performance analytics so as to effectively address the growing need for processing-hungry and data-intensive deep learning approaches. In the emerging era of Big-data, data-intensiveness was dealt with through in-memory solutions. However, nowadays, as deep learning is becoming the new computing model, the growing demand for more compute-intensive procedures cannot be addressed without investing in performance upgrades; and this is where GPU-integration comes into play.
The range of applications is huge: medicine; computer vision; robotics and self-driving cars, but, more importantly, the impact on the increase in prediction accuracy by models that can be tractably produced is the real game changer here, which is the result of processing fast, Zettabytes of data.
The integration of GPUs in the Apache Spark ecosystem can be achieved in three ways. The first is to accelerate Spark libraries as well as operations without altering the existing interfaces or the underlying programming model. The second is by generating native GPU code from abstract, high-level Spark code and the third is an integration with an external GPU-accelerated system. Each of these approaches has its pros and cons.
One way or another we are getting closer to entering a new computation era with immense potential. Have a look at this blog post if you want to find out more.
Time Series Analysis with Spark in the Automotive - R&D Process
This talk by Miha Pelko and Tilmann Piffi of NorCom IT AG drew my attention. The duo discussed how the car industry has also entered the Big Data era due to the increased number of sensors being used for automation and information collection purposes, especially in the field of advanced driver assistance systems (ADAS). Most of the challenges related to processing these data orient around time-series analysis and the identification of pre-defined failure events—a combination of conditions on the data coming from various sensors—in the logged data from devices within a drive, and subjecting these data to statistical causality analysis.
What Miha and Tilmann argued was that most car manufacturers are facing the same or similar challenges in this field. Thus , they developed a language build on top of the Spark API in order to help with the specification of failure conditions and their analysis in a seamless way, effectively increasing productivity. Additional features include optimisation of the storage schemas which leads to faster computations.
An interesting aspect of the talk was the inter-communication between devices in a car which is based on synchronised state machines (SSM). The problem is that since state machines are inherently sequential—as Markov-Chains are—handling parallel analysis becomes an interesting yet complex issue. Part of the talk was to show how one can overcome this difficulty through the use of an algorithm specifically developed for this purpose.
Online Learning with Structured Streaming (Data Science)
So, as I mentioned in the first part of this blog streaming was definitely a strong aspect of the conference. Given so, I couldn't miss the one talk on the new Streaming API in Spark 2.0. The talk was given by Ram Sriharsha, the Product Manager for open source efforts at Databricks, and it was on how it can be used for training and updating online models in a continuous fashion. There is an increasing demand for this kind of model development which has the added value of "keeping up to date with the latest data updates" thus allowing for more effective predictions. However, developing such models had many pain points if was to rely on DStreams.
One such pain point is processing with event-time data and dealing with late data "arrivals". Since DStream exposes the underlying micro-batch architecture and is thus unable to incorporate event-time processing. The other is an issue that we usually come across when a platform claims to be something like a panacea—an all in one solution. One could argue that, often, such claims/expectations lead to radical advancements but sometimes they also lead to the cultivation of over-expectations that end with disappointment. I must say that the latter has not been the case with Spark so far; quite the contrary. But is true that we assume Spark to be an interactive platform through which one can interoperate streaming with batch. However, there are specific data structures used for each case: RDDs for batch processing and DStreams for stream processing. Though the two structures are similar they are still different and it comes down to the developer to translate one into the other. Finally, data consistency and fail-tolerance are also issues that are difficult to handle when data are constantly being updated.
Enter scene Structured Streaming. The purpose of structured streaming is to allow the end user to not have to reason at all about the backend implementation. In other words, to not have to worry about any of the problems listed above whether concerned with translating from DStreams to RDDs or consistency and fault-tolerance, all these should somehow be magically handled so that the developer can worry about the basics of the model implementation. So how does it work?
Essentially, a stream can be understood as an infinitely growing table. Now, one can define an SQL query on such tables and get a result table out of it. Of course, not all such queries need to be applied on the whole of the original table. Depending on user needs it may be that only a portion of the table needs to be read, and possibly updated or filtered to produce a resulting table. These result tables, produced by such queries can then be integrated into a complete output which serves as the final output. Effectively, the process is broken down to 3 levels of input, result (intermediate) and output tables. The last tables concern those that will be written to an external system. The Structured Streaming API is essentially the same API as batch and can support both static, bounded data as well as streaming, unbounded data, which means that it is not necessary to master a new API!
Ram goes one to explain how the end result is a robust pipeline for updating models that is scalable, fast and fault tolerant, but also very productive since it eliminates the need to learn a new API or worry about the numerous complexities that come with online model development.
Extending Spark with Java Agents
The last talk I attended was given by Jaroslav Bachorik, a member of the engineering team at Unravel. Jaroslav wanted to make a simple point: that optimising the computation process in Spark is a very complicated issue and that some assistance in the form of guidance by the system is imperative. This guidance can come from utilising Java agents.
In short, the main objective is to run a program along with the Java agents; allow the agents to collect metrics and follow the processing order and then; get recommendations as to enhance or modify the core Spark functionality in reliable and efficient ways. An example of such a recommendation could be to cache some RDDs that are currently not being cached or prefer to cache a certain RDD over another.
Note that this can only be achieved after an initial run and once the agents have logged information, effectively collecting custom metrics. It is also worth mentioning that this approach also allows for some auto-tuning of the Spark configuration on the fly.
...the summit was very interesting and very, very promising. Many useful advancements were presented and we left the conference with many things to look forward to.
That's all from Brussels. Hope you found the blogs interesting (like and share if you did). I will be writing again soon on some more technical topics related to graph analytics so... stay tuned.