Big Data and event processing pipeline
Paul JULIEN, Mewen LE RESTE and Guillaume VERNERET are three developers interns at vente-privee Epitech Innovation Lab. This intership is meant to contribute to their college cycle at Epitech Paris. They worked together on a web platform that aims to facilitate testing in vente-privee teams.`
The big data term is more and more used by enterprises. Many of them want to implement infrastructures to build an efficient environment. When you don’t know exactly what you are doing, it’s very difficult to find technologies that really match our demand. The world of big data can often seem very opaque, we will try to clarify it in this article.
What is Big Data?
The big data term has been in use since de 1998, thanks to John Mashey, a chief scientist at Silicon Graphics, with the paper “Big Data … and the Next Wave of InfraStress”. The big data is associated to three Vs: volume, variety and velocity. The volume is about the quantity of the data, measured in gigabyte to yottabyte (1⁰²⁴ bytes). The velocity is the time when the data can be treated, today we talk about real time. Lastly, the variety is the structure, it can be unstructured, but it can be also photo, audio or video data. There are a lot of different types of data.
To summarize in few figures, every minute, more than 31 million messages are sent by Facebook users, and which more than 2 million videos. There are more than 3 billion Google searches every day. In 2020, the accumulated volume of big data will grow up to 44 trillion GB (According to Newgenapps and Domo).
I’ve tons of data, what can I do with them?
Enterprises are increasingly interested in big data problematics because they have real economics interests. For example, for a web media, you can do some analytics on the click stream of your application, add some ads targeting and do some predictions about the average users during a period. Thematic such as the learning machine or data sciences are essential today. This is a schema of a simplified event processing pipeline:
The goal is to optimize the treatment of the data (events) in different parts:
· Broker: Raw data storage.
· Stream: Data treatment.
· Database: Treated data storage.
· Visualizer: Visual data.
The universe of big data answers questions that many people have asked themselves before. Despite everything, there are some problems related to the big data. First, the storage of your data must be optimized to reduce costs. Then, the data privacy is essential, as we can see, major societies like Facebook or Google are beset by privacy problems (see Cambridge Analytica affair). The problem we’re going to study is related to technology choice.
How to choose the right technologies?
Big data’s technologies are numerous, it’s sometimes hard to find your way around. A major actor of big data is Apache, among their best-known technologies are Kafka, Hadoop or Spark. For some, they excel in their field, for example Hadoop which allows optimized storage with an advanced algorithm implemented by Google: MapReduce, or Kafka, which is used as a message broker, it’s very powerful when you want to distribute a lot of data in different streams.
For the stream processing, there are a lot of technologies, if you study different existing use case, it’s simpler to find your way. Let’s take the example of the data stream processing, today there are many technologies in this field and all of them have pros and cons. A couple of companies like Google (Cloud DataFlow who has become Apache Beam) or Spotify (SCIO) made their own technology to solve their problem. If we take the use case of Vente-Privee, many data are collected with an API which push them into a Kafka. The problematic was to aggregate the data in real-time or with an history, with the same technology to be the simpler possible to implement. Here is the simplified use case we have made.
Quick Use Case
The case is very simple: we have a lot of data (some events) in Kafka, we need to generate KPI (Key Performance Indicator) and push the aggregated data into a database. We need now to find the right technologies, first to implement the stream processing. For that, there are some technologies at our disposal:
These are frameworks that will facilitate data aggregation, most of them use java language.
We made a quick benchmark to simplify our choice, requirements were simple:
· Documentation: Quality of the support.
· Compatibility: It is compatible with implemented technologies.
· Getting Started: Time to have a proof of concept.
· Deployment: Runnable and deployable with Docker.
· Optimization: Quality of the technology when you go deeper.
The methodology to determine criteria was simple, we were testing a point, if it wasn’t negative and blocking, we switched to the next technology.
In this case, Beam is the most suitable. To detail, it includes many big data technologies such as Kafka reading, Cassandra input/output and many others. The way to manage data flows with a system of pipelines is very optimized. It way of running is also very handy, you have the choice to run “first-hand” or with specific runners such as Spark or Flink. Which brings to the big positive point of Apache Beam, the scalability. When you have a big amount of data to be processed, you need to run multiple aggregators at the same time, you will need a scalable technology such as Beam, to avoid bottlenecks. There are some negative points, but not blocking, like the time series databases output, in our case we just use Scylla database (C++ Cassandra overlayer) as a time series, with its official data visualizer, Apache Zeppelin. In addition, Apache Beam is a young technology, so there’s a big lack of community, and it’s very complicated, if you are not convenient with the Java language, to understand how it works.
After choosing a set of technologies, you will need to implement, deploy and test them. A big data environment grows up very quickly and it’s important to have the hand on it. Make sure all the technologies can be tested and deployed (Gitlab CI is very convenient for that). Many developers work on big data, but the community is still very weak, it’s also very crucial to contribute to it.
Here’s some links to go further:
https://www.mongodb.com/big-data-explained
https://data-artisans.com/what-is-stream-processing
Authors : Paul Julien, Guillaume VERNERET, Mewen LE RESTE