Data Engineering on
Google Clod Platform

Why should you enroll this course?

Nowadays everyone would like to make data-driven decisions. The challenge here is to provide meaningful, relevant, and up-to-date data for the decision makers. In this course, you will learn how to enable data-driven decision making by collecting, transforming, and visualizing data using Google’s revolutionary big data tools.

It's for you if

  • you do not enjoy setting up hadoop clusters on bare machines;
  • you want to know how to crunch terabytes of data easily using Google Cloud Platform’s practically infinite resources;
  • you want work with machine learning (ML) without a PHD;
  • you like to focus on real work instead of setting up the infrastructure;
  • you want to know how create end-to-end data pipelines from the sources to the visualization;
  • you want real project examples and hands-on exercises.

You are the perfect audience if you are

  • an experienced developer responsible for managing big data transformation including:
  • extracting, loading, transforming, cleaning, and validating data;
  • designing pipelines and architectures for data processing;
  • creating and maintaining ML and statistical models;
  • querying datasets, visualizing query results, and creating reports.

We will teach you to

  • design and build data-processing systems on Google Cloud Platform;
  • process batch and streaming data by implementing autoscaling data pipelines on Cloud Dataflow;
  • derive business insights from extremely large datasets using Google BigQuery;
  • train, evaluate ML models, and predict with them using Tensorflow and Cloud ML Engine;
  • leverage unstructured data using Spark and ML APIs on Cloud Dataproc;
  • extract instant insights from streaming data.

Our trainers

TRAINERS GO HERE

Type of course

We offer an onsite instructor-led class. It’s a combination of presentations, demos, and hands-on labs.

Duration

4 days (from 9 am to 6 pm).

Dates

Budapest
Münich
Singapore
Virtual

Locations

date
date
date
date

Pricing

xxx EUR
xxx EUR
xxx EUR
xxx EUR

Equipment

A personal laptop is required for all workshops and will not be provided.

Prerequisites

  • Basic proficiency with common query language such as SQL;
  • Experience with data modeling, extract, transform, load activities, using a common programming language such Python or Java;
  • Familiarity with ML and/or statistics;
  • Completed Google Cloud Fundamentals - Big Data and Machine Learning course OR equivalent experience.

Course outline

Module 01

Google Cloud Dataproc overview
  • Creating and managing clusters.
  • Leveraging custom machine types and pre-emptible worker nodes.
  • Scaling and deleting clusters.
  • Lab: Creating Hadoop Clusters with Google Cloud Dataproc.

Module 02

Running Dataproc jobs
  • Running Pig and Hive jobs.
  • Separating storage and compute.
  • Lab: Running Hadoop and Spark jobs with Dataproc.
  • Lab: Submit and monitor jobs.

Module 03

Integrating Dataproc with Google Cloud Platform
  • Customizing cluster with initialization actions.
  • Supporting BigQuery.
  • Lab: Leveraging Google Cloud Platform Services.

Module 04

Making sense of unstructured data with Google’s Machine Learning APIs
  • Examining Google’s Machine Learning APIs.
  • Exploring common ML use cases.
  • Invoking ML APIs.
  • Lab: Adding ML capabilities to big data analysis.

Module 05

Serverless data analysis with BigQuery
  • What is BigQuery?
  • Queries and functions.
  • Lab: Writing queries in BigQuery.
  • Loading data into BigQuery.
  • Exporting data from BigQuery.
  • Lab: Loading and exporting data.
  • Discovering nested and repeated fields.
  • Querying multiple tables.
  • Lab: Complex queries.
  • Performance and pricing.

Module 06

Serverless, autoscaling data pipelines with Dataflow
  • The Beam programming model.
  • Data pipelines in Beam Python.
  • Data pipelines in Beam Java.
  • Lab: Writing a Dataflow pipeline.
  • Scalable big data processing using Beam.
  • Lab: MapReduce in Dataflow.
  • Incorporating additional data.
  • Lab: Side inputs.
  • Handling stream data.
  • GCP Reference architecture.

Module 07

Getting started with Machine Learning
  • What is machine learning (ML).
  • Effective ML: concepts, types.
  • ML datasets: generalization.
  • Lab: Explore and create ML datasets.

Module 08

Building ML models with Tensorflow
  • Getting started with TensorFlow.
  • Lab: Using tf.learn.
  • TensorFlow graphs and loops + lab.
  • Lab: Using low-level TensorFlow + early stopping.
  • Monitoring ML training.
  • Lab: Charts and graphs of TensorFlow training.

Module 09

Scaling ML models with CloudML
  • Why Cloud ML?
  • Packaging up a TensorFlow model.
  • End-to-end training.
  • Lab: Run an ML model locally and on cloud.

Module 10

Feature engineering
  • Creating good features.
  • Transforming inputs.
  • Developing synthetic features.
  • Preprocessing with Cloud ML.
  • Lab: Feature engineering.

Module 11

Architecture of streaming analytics pipelines
  • Streaming data processing: Challenges.
  • Handling variable data volumes.
  • Dealing with unordered/late data.
  • Lab: Designing streaming pipeline.

Module 12

Ingesting variable volumes
  • What is Cloud Pub/Sub?
  • How it works: Topics and subscriptions.
  • Lab: Simulator.

Module 13

Implementing streaming pipelines
  • Challenges in stream processing.
  • Handle late data: watermarks, triggers, accumulation.
  • Lab: Stream data processing pipeline for live traffic data.

Module 14

Streaming analytics and dashboards
  • Streaming analytics: from data to decisions.
  • Querying streaming data with BigQuery.
  • What is Google Data Studio?
  • Lab: Build a real-time dashboard to visualize processed data.

Module 15

High throughput and low-latency with Bigtable
  • What is Cloud Spanner?
  • Designing Bigtable schema.
  • Ingesting into Bigtable.
  • Lab: Streaming into Bigtable.