Azure Databricks repository is a set of blogposts as a Advent of 2020 present to readers for easier onboarding to Azure Databricks!
Series of Azure Databricks posts:
- Dec 01: What is Azure Databricks
- Dec 02: How to get started with Azure Databricks
- Dec 03: Getting to know the workspace and Azure Databricks platform
- Dec 04: Creating your first Azure Databricks cluster
- Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and jobs
- Dec 06: Importing and storing data to Azure Databricks
- Dec 07: Starting with Databricks notebooks and loading data to DBFS
- Dec 08: Connect to Azure Blob storage using Notebooks in Azure Databricks
Yesterday we introduced the Databricks CLI and how to upload the file from "anywhere" to Databricks. Today we will look how to use Azure Blob Storage for storing files and accessing the data using Azure Databricks notebooks.
We will need to go outside of Azure Databricks to Azure portal. And search for Storage accounts.
And create a new Storage account by clicking on "+ Add". And select the subscription, Resource group, Storage account name, location, account type and replication.
Continue to set up networking, data protection, advance settings and create the storage account. When you are finished with storage account, we will create a storage itself. Note that General Purpose v2 Storage accounts support latest Azure Storage features and all functionality of general purpose v1 and Blob Storage accounts. General purpose v2 accounts bring lowest per-gigabyte capacity prices for Azure storege and support following Azure Storage services:
- Blobs (all types: Block, Append, Page)
- Data Lake Gen2
- Files
- Disks
- Queues
- Tables
Once the Account is ready to be used, select it and choose "Container".
Container is a blob storage for unstructured data and will communicate with Azure Databricks DBFS perfectly. When in Container part, select "+ Container" to add new container and give a container a name.
Once the container is created, click on the container to get additional details.
Your data will be stored in this container and later used with Azure Databricks Notebooks. you can also access the storage using Microsoft Azure Storage Explorer. It is much more intuitive and and offers easier management, folder creation and binary files management.
You can upload a file using Microsoft Azure Storage Explorer tool or directly on portal. But in organisation, you will have files and data being here copied automatically using many other Azure service. Upload a file that is available for you on Github repository (data/Day9_MLBPlayers.csv - data file is licensed under GNU) to blob storage container in any desired way. I have used Storage explorer and simply drag and dropped the file to container.
Before we go back to Azure Databricks, we need to set the access policy for this container. Select "Access Policy"
We need to create a Shared Access Signature which is a general Microsoft grant to access the storage account. Click on Access policy from left menu and once new site is loaded, select "+ Add Policy" under Shared access policies and give it a name, access and validity period:
Click OK to confirm and click Save (save icon). Go back to Storage account and on the left select Shared Access Signature.
Under Allowed resource types, it is mandatory to select Container, but you can select all. Set the Start and expiry date - 1 month in my case. Select button "Generate SAS and connection string" and copy paste the needed strings; connection string and SAS token should be enough (copy and paste it to a text editor)
Once this is done, let's continue with Azure Databricks notebooks.
Start up a cluster and create new notebooks (as we have discussed on Day 4 and Day 7). The notebook is available at Github.
And the code is:
%scala val containerName = "dbpystorecontainer" val storageAccountName = "dbpystorage" val sas = "?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-12-09T06:15:32Z&st=2020-12-08T22:15:32Z&spr=https&sig=S%2B0nzHXioi85aW%2FpBdtUdR9vd20SRKTzhNwNlcJJDqc%3D" val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"
with the mount function.
dbutils.fs.mount( source = "wasbs://[email protected]/Day9_MLBPlayers.csv", mount_point = "/mnt/storage1")
When you run a following scala command, it will generate a data.frame called mydf1 data.frame
%scala val mydf1 = spark.read .option("header","true") .option("inferSchema", "true") .csv("/mnt/storage1") display(mydf1)
And now we can start exploring the dataset. And I am using R language.
This was a long but important topic that we have addressed. Now you know how to addree and store data.
Tomorrow we will check how to start using Notebooks and will be for now focusing more on analytics and less on infrastructure.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!