Adopt a cloud solutions for your Big Data Projects
I was particularly interested in the sessions presented by Aurélie Vache (Atchik), Didier Girard (Sfeir) and Alain Regnier (Alto Labs) dealing with Big Data implementation using the Google Cloud Platform(GCP), and especially how to facilitate the implementation of Big Data projects using the services of the google cloud stack.
The Google Cloud Platform offers an integrated end to end Big Data solution.
It lets you capture, process, store and analyse your data within a single platform. These services are serverless which frees you from the need to build, manage and operate complex and costly infrastructures. This is obviously a major benefit in terms of implementation cost for your Big Data projects.
We can particularly thank Didier Girard for his excellent session where he described the challenges that developers face when dealing with Big Data and how to solve them using the GCP.
According to him, the volume of data being generated by today’s applications is increasing at dizzying rates. The real challenge is not to store a big amount of data to analyse them at a future date, but to be able to consume a large volume of events and extract meaningful informations in real time. He then presented the typical GCP architecture used to build such a solution and did a live demo of an application consuming roughly 100000 events per second.
Cloud Pub/Sub is a real-time messaging service that allows you to send and receive data between independent applications. A messaging queue supports many to many communications, meaning that a Pub/Sub can be configured to support multiple publishers and subscribers.
It is highly scalable, as the service can handle 10000 messages per second by default and up to several millions on demand. That makes it a major asset to capture big amount of real time events originating from various sources, such as Web analytics or IOT devices. It is also secure as all data on the wire is encrypted.
Cloud Storage is an object storage solution in the cloud similar to the very well known Amazon S3. It can be used for live data serving, data analytics or data archiving. Objects stored within Cloud Storage are organized in buckets that can be controlled and secured independently.
The service also provides a Restful API that allows developers to programmatically access the storage and dynamically upload data to be processed, such as batch files.
Cloud DataFlow is a data processing service for both batch and real-time data streaming. It allows to set up processing pipelines for integrating, preparing and analyzing large data sets. Partly based on Google frameworks MillWheel and FlumeJava, it is designed for large scale data ingestion and low latency processing. Cloud Dataflow can consume data in publish-and-subscribe mode from Google Cloud Pub/Sub or, in batch mode from any database or file system (such as Cloud Storage). It fully supports SQL queries via Google BigQuery which makes it the natural solution to interface Google Pub/Sub and BigQuery.
BigQuery is a fully managed data warehouse for large-scale data analytics. It offers users the ability to manage data using SQL-like queries against very large data sets and get results with an excellent response time. BigQuery uses columnar storage and Tree architecture to dispatch queries and aggregate results across multiple machines. As a result, BigQuery is able to run queries on datasets containing terabytes of data within a few seconds. BigQuery also provides a REST API that lets you control and query your data using your favorite client language and libraries. Alternatively, you can access it through command-line tools or google online console.
In conclusion, I would say that the Google Cloud Platform is definitely a solution to consider when implementing your Big Data projects.
Its serverless and scalable approach represent a major asset to meet today’s requirements of applications generating an unprecedented amount of data from diverse sources. The GCP services are also very well documented, and an infrastructure can be put in place fairly quickly. Lastly, the major benefit is probably the pricing, as the cost associated to the use of these services is very minimal and can be finely controlled through the google online console (e.g. BigQuery pricing).