lab | ||||
---|---|---|---|---|
|
In this module, students will learn how to ingest and process streaming data at scale with Event Hubs and Spark Structured Streaming in Azure Databricks. The student will learn the key features and uses of Structured Streaming. The student will implement sliding windows to aggregate over chunks of data and apply watermarking to remove stale data. Finally, the student will connect to Event Hubs to read and write streams.
In this module, the student will be able to:
- Know the key features and uses of Structured Streaming
- Stream data from a file and write it out to a distributed file system
- Use sliding windows to aggregate over chunks of data rather than all data
- Apply watermarking to remove stale data
- Connect to Event Hubs read and write streams
- Module 11 - Create a stream processing solution with Event Hubs and Azure Databricks
- Lab details
- Concepts
- Event Hubs and Spark Structured Streaming
- Streaming concepts
- Lab setup and pre-requisites
Apache Spark Structured Streaming is a fast, scalable, and fault-tolerant stream processing API. You can use it to perform analytics on your streaming data in near real time.
With Structured Streaming, you can use SQL queries to process streaming data in the same way that you would process static data. The API continuously increments and updates the final data.
Azure Event Hubs is a scalable real-time data ingestion service that processes millions of data in a matter of seconds. It can receive large amounts of data from multiple sources and stream the prepared data to Azure Data Lake or Azure Blob storage.
Azure Event Hubs can be integrated with Spark Structured Streaming to perform processing of messages in near real time. You can query and analyze the processed data as it comes by using a Structured Streaming query and Spark SQL.
Stream processing is where you continuously incorporate new data into Data Lake storage and compute results. The streaming data comes in faster than it can be consumed when using traditional batch-related processing techniques. A stream of data is treated as a table to which data is continuously appended. Examples of such data include bank card transactions, Internet of Things (IoT) device data, and video game play events.
A streaming system consists of:
- Input sources such as Kafka, Azure Event Hubs, IoT Hub, files on a distributed system, or TCP-IP sockets
- Stream processing using Structured Streaming, forEach sinks, memory sinks, etc.
- You have successfully completed Module 0 to create your lab environment.
To complete this lab, you will need to create an event hub.
-
In the Azure portal, select + Create a resource. Enter event hubs into the Search the Marketplace box, select Event Hubs from the results, and then select Create.
-
In the Create Namespace pane, enter the following information:
- Subscription: Select the subscription group you're using for this module.
- Resource group: Choose your module resource group.
- Namespace name: Enter a unique name, such as databricksdemoeventhubsxxx, where xxx are your initials. Uniqueness will be indicated by a green check mark.
- Location: Select the location you're using for this module.
- Pricing tier: Select Basic.
-
Select Review + create, then select Create.
-
After your Event Hubs namespace is provisioned, browse to it and add a new event hub by selecting the + Event Hub button on the toolbar.
-
On the Create Event Hub pane, enter:
- Name: Enter
databricks-demo-eventhub
. - Partition Count: Enter 2.
Select Create.
- Name: Enter
-
On the left-hand menu in your Event Hubs namespace, select Shared access policies under Settings, then select the RootManageSharedAccessKey policy.
-
Copy the connection string for the primary key by selecting the copy button.
-
Save the copied primary key to Notepad.exe or another text editor for later reference.
-
If you do not currently have your Azure Databricks workspace open: in the Azure portal, navigate to your deployed Azure Databricks workspace and select Launch Workspace.
-
In the left pane, select Compute. If you have an existing cluster, ensure that it is running (start it if necessary). If you don't have an existing cluster, create one that uses the latest runtime and Scala 2.12 with spot instances enabled.
-
When your cluster is running, in the left pane, select Workspace > Users, and select your username (the entry with the house icon).
-
In the pane that appears, select the arrow next to your name, and select Import.
-
In the Import Notebooks dialog box, select the URL and paste in the following URL:
https://github.com/MicrosoftLearning/DP-203-Data-Engineer/raw/master/Allfiles/microsoft-learning-paths-databricks-notebooks/data-engineering/DBC/10-Structured-Streaming.dbc
- Select Import.
- Select the 10-Structured-Streaming folder that appears.
Open the 1.Structured-Streaming-Concepts notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within.
Within the notebook, you will:
- Stream data from a file and write it out to a distributed file system
- List active streams
- Stop active streams
After you've completed the notebook, return to this screen, and continue to the next step.
In your Azure Databricks workspace, open the 10-Structured-Streaming folder that you imported within your user folder.
Open the 2.Time-Windows notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within.
Within the notebook, you will:
- Use sliding windows to aggregate over chunks of data rather than all data
- Apply watermarking to throw away stale old data that you do not have space to keep
- Plot live graphs using
display
After you've completed the notebook, return to this screen, and continue to the next step.
In your Azure Databricks workspace, open the 10-Structured-Streaming folder that you imported within your user folder.
Open the 3.Streaming-With-Event-Hubs-Demo notebook. Make sure you attach your cluster to the notebook before following the instructions and running the cells within.
Within the notebook, you will:
- Connect to Event Hubs and write a stream to your event hub
- Read a stream from your event hub
- Define a schema for the JSON payload and parse the data to display it within a table