Nóra Ambróz | January 22, 2020
5 Minutes Read
Both Google Cloud Dataflow and Apache Spark are big data tools that can handle real-time, large-scale data processing. They have similar directed acyclic graph-based (DAG) systems in their core that run jobs in parallel. But while Spark is a cluster-computing framework designed to be fast and fault-tolerant, Dataflow is a fully-managed, cloud-based processing service for batched and streamed data. In many cases both are viable alternatives, but each has their well defined strengths and weaknesses respectively.
Let’s make a Dataflow vs. Spark comparison to see the differences in models, resource management, analytic tools and streaming capabilities.
Dataflow vs. Spark-Programming Models
Spark has its roots leading back to the MapReduce model, which allowed massive scalability in its clusters. The engine handles various data sources such as Hive, Avro, Parquet, ORC, JSON, or JDBC. Its central concept is the Resilient Distributed Dataset (RDD), which is a read-only multiset of elements. RDDs can be partitioned across the nodes of a cluster, while operations can run in parallel on them.
Another option is to make a distributed collection, a DataFrame from the input, which is structured into labelled columns. DataFrames are similar to relational database tables – so much that you can even run Spark SQL queries on them. Alternatively, you can use an extension of the DataFrame API, which introduces Datasets that provide type safety for object oriented programming. Spark API is available for R, Python, Java, and Scala.
In terms of API and engine, Google Cloud Dataflow is close to analogous to Apache Spark. Dataflow’s model is Apache Beam that brings a unified solution for streamed and batched data. Beam is built around pipelines which you can define using the Python, Java or Go SDKs. Then Dataflow adds the Java- and Python-compatible, distributed processing backend environment to execute the pipeline. The runtime agnostic nature of Beam makes it also possible to swap to an Apache Apex, Flink or Spark execution environment.
A pipeline encapsulates every step of a data processing job from ingestion, through transformations until finally releasing an output. The pipeline operations, the PTransforms process distributed datasets called PCollections. The SDK provides these abstractions in a unified fashion for bound (batched) and unbound (streamed) data.
When an analytics engine can handle real-time data processing, the results can reach the users faster. Unlike with periodically processed batches there is no need to wait for the entire task to finish. Gaining insights quickly and interactively can make a difference in many areas.
When you set Spark against Dataflow in streaming, they are almost evenly matched. For Apache Spark, the release of the 2.4.4 version brought Spark Streaming for Java, Scala and Python with it. This extension of the core Spark system allows you to use the same language integrated API for streams and batches. Dataflow with Apache Beam also has a unified interface to reuse the same code for batch and stream data. Beside simplicity, this allows you to run ad-hoc batch queries against your streams or reuse real-time analytics on historical data.
Dataflow’s Streaming Engine also adds the possibility to update live streams on the fly without ever stopping to redeploy. Given that the environment itself is highly reliable, downtime can decrease to marginal amounts.
Stream processing usually handles windows, which means that the unbounded data gets grouped into bounded collections. One of the most popular windowing strategies is to group the elements by the timestamp of their arrival. Spark’s Streaming API uses Discretized Stream (DStream) to generate periodically new RDDs to formulate a continuous sequence of them. The DStream accepts a function which is used to generate an RDD after a fixed time interval.
Besides arrival time, Dataflow allows true event time based processing for each of its windowing strategies. Tumbling (or for Beam, fixed) windows use non-overlapping time intervals. Hopping (sliding) windows can overlap; for example, they can collect the data from the last five minutes every ten seconds. Session windows use gap time and keys. When the time between two arrivals with a certain key is larger than the gap, a new window starts. And if this wasn’t enough, there is also an option to create custom windows. For further control a Watermark can indicate when you expect all the data to have arrived. Combined with Triggers you can set up when to emit the results.
For analytic tools, Spark brings SQL queries, real-time stream, and graph analysis as well as machine learning to the table. Spark SQL works in unison with the DataFrame API. DataFrames has named columns like a relational database, so analysts can execute dynamic queries on them using the familiar SQL syntax. The system comes with built-in optimization, columnar storage, caching and code generation to make matters faster and cheaper. The Spark Core engine provides in-memory analysis for raw, streamed, unstructured input data through the Streaming API. GraphX extends the core features with visual graph analysis to inspect your RDDs and operations. Finally MLlib is a machine learning library filled with ready-to-use classification, clustering, and regression algorithms.
Dataflow is deeply integrated with Google Cloud Platform’s other services, and relies on them to provide insights. SQL queries are available through the BigQuery Web UI using the ZetaSQL syntax. BigQuery is also a fully-managed service, so no hardware allocation is necessary. Make a joined stream of a snapshotted BQ dataset and a Pub/Sub subscription, then write to BQ for dashboarding. Other services enable machine learning like AutoML Tables or Google AI Platform.
Even though their models bear a resemblance, Spark and Dataflow have large differences in resource management. With Apache Spark, the first step is usually to deploy a MapReduce cluster with nodes, then submit a job. After this comes the fine-tuning of the resources manually to build up or tear down clusters. For this purpose Spark allows a pluggable cluster manager. The selection includes Kubernetes, Hadoop YARN, Mesos, or the built-in Spark Standalone option. Each manager works with master and slave nodes, while they also provide solutions for security, high availability, scheduling and monitoring. Spark has the facilities to share cluster resources between running jobs, and reallocate resources with simple deployment scripts.
Dataflow on the other hand is a fully-managed service under Google Cloud Platform (GCP). The built-in loadbalancer works with horizontal autoscaling to add or remove workers to the environment as the demand requires. The automated, dynamic management lifts the necessity of dev-ops and minimizes the need for optimization. For cost control you can set the minimum and maximum number of Compute Engine workers and their type among others.
Whether your project wishes to take advantage of a built-in loadbalancer or not, can decide between the two options.
Dataflow vs. Spark: Minor Factors
Compared to the key differences between Dataflow vs. Spark, the next factors are not make-or-break. Still they can tip the scale in some cases, so let’s not forget about them.
In this article I compared Dataflow vs. Spark based on their programming model, streaming facilities, analytic tools and resource management.
With Apache Spark we went through some features of the Core engine including RDDs, then touched on the DataFrames, Datasets, Spark SQL and Streaming API. There was also an overview of Apache Beam, the data processing model behind Dataflow.
It turned out both tools have options to easily swap between batches and streams. Spark featured basic possibilities to group and collect stream data into RDDs. However Beam featured more exhaustive windowing options complete with Watermarks and Triggers.
Spark’s main analytic tools included Spark SQL queries, GraphX and MLlib. In the same field Dataflow had the other GCP services like BigQuery and AutoML Tables.
The greatest difference lied in resource management. Deploying and managing a Spark cluster requires some effort on the dev-ops part. In opposition, Dataflow is a fully managed no-ops service with an automated loadbalancer and cost-control.
The comparison showed that Google Cloud Dataflow and Apache Spark are usually good alternatives for each other, but based on their differences it is hopefully easier now to find the suitable solution for your project.
5 Good Reasons to Move to a Cloud-based Data Warehouse
3 Minutes Read
Now, let me walk you through the benefits of cloud-based data warehouses, one by one.
Google BigQuery materialized view test drive
3 Minutes Read
I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.
Part 1: Customer lifetime value estimation via probabilistic modeling
15 Minutes Read
Customer lifetime value (CLV) is the total worth of a customer to a company over the length of their relationship. In practice, this "worth" can be defined as revenue, profit, or any metric of an analyst's choosing.