title | description | services | documentationcenter | author | manager | ms.assetid | ms.service | ms.workload | ms.topic | ms.date | ms.author | robots |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Copy data to/from Azure Blob Storage |
Learn how to copy blob data in Azure Data Factory. Use our sample: How to copy data to and from Azure Blob Storage and Azure SQL Database. |
data-factory |
linda33wj |
shwang |
bec8160f-5e07-47e4-8ee1-ebb14cfb805d |
data-factory |
data-services |
conceptual |
01/05/2018 |
jingwang |
noindex |
[!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"]
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see Azure Blob Storage connector in V2.
This article explains how to use the Copy Activity in Azure Data Factory to copy data to and from Azure Blob Storage. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.
You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage to any supported sink data store. The following table provides a list of data stores supported as sources or sinks by the copy activity. For example, you can move data from a SQL Server database or a database in Azure SQL Database to an Azure blob storage. And, you can copy data from Azure blob storage to Azure Synapse Analytics or an Azure Cosmos DB collection.
[!INCLUDE updated-for-az]
You can copy data from Azure Blob Storage to the following data stores:
[!INCLUDE data-factory-supported-sink]
You can copy data from the following data stores to Azure Blob Storage:
[!INCLUDE data-factory-supported-sources]
Important
Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob storage. The activity supports reading from block, append, or page blobs, but supports writing to only block blobs. Azure Premium Storage is not supported as a sink because it is backed by page blobs.
Copy Activity does not delete data from the source after the data is successfully copied to the destination. If you need to delete source data after a successful copy, create a custom activity to delete the data and use the activity in the pipeline. For an example, see the Delete blob or folder sample on GitHub.
You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using different tools/APIs.
The easiest way to create a pipeline is to use the Copy Wizard. This article has a walkthrough for creating a pipeline to copy data from an Azure Blob Storage location to another Azure Blob Storage location. For a tutorial on creating a pipeline to copy data from an Azure Blob Storage to Azure SQL Database, see Tutorial: Create a pipeline using Copy Wizard.
You can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:
- Create a data factory. A data factory may contain one or more pipelines.
- Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to Azure SQL Database, you create two linked services to link your Azure storage account and Azure SQL Database to your data factory. For linked service properties that are specific to Azure Blob Storage, see linked service properties section.
- Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the SQL table in Azure SQL Database that holds the data copied from the blob storage. For dataset properties that are specific to Azure Blob Storage, see dataset properties section.
- Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Blob Storage, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Blob Storage, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Blob Storage.
There are two types of linked services you can use to link an Azure Storage to an Azure data factory. They are: AzureStorage linked service and AzureStorageSas linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services.
[!INCLUDE data-factory-azure-storage-linked-services]
To specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of the dataset to: AzureBlob. Set the linkedServiceName property of the dataset to the name of the Azure Storage or Azure Storage SAS linked service. The type properties of the dataset specify the blob container and the folder in the blob storage.
For a full list of JSON sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
Data factory supports the following CLS-compliant .NET based type values for providing type information in “structure” for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. Data Factory automatically performs type conversions when moving data from a source data store to a sink data store.
The typeProperties section is different for each type of dataset and provides information about the location, format etc., of the data in the data store. The typeProperties section for dataset of type AzureBlob dataset has the following properties:
Property | Description | Required |
---|---|---|
folderPath | Path to the container and folder in the blob storage. Example: myblobcontainer\myblobfolder\ | Yes |
fileName | Name of the blob. fileName is optional and case-sensitive. If you specify a filename, the activity (including Copy) works on the specific Blob. When fileName is not specified, Copy includes all Blobs in the folderPath for input dataset. When fileName is not specified for an output dataset and preserveHierarchy is not specified in activity sink, the name of the generated file would be in the following this format: Data.<Guid>.txt (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |
No |
partitionedBy | partitionedBy is an optional property. You can use it to specify a dynamic folderPath and filename for time series data. For example, folderPath can be parameterized for every hour of data. See the Using partitionedBy property section for details and examples. | No |
format | The following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, ParquetFormat. Set the type property under format to one of these values. For more information, see Text Format, Json Format, Avro Format, Orc Format, and Parquet Format sections. If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |
No |
compression | Specify the type and level of compression for the data. Supported types are: GZip, Deflate, BZip2, and ZipDeflate. Supported levels are: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. | No |
As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the partitionedBy property, Data Factory functions, and the system variables.
For more information on time series datasets, scheduling, and slices, see Creating Datasets and Scheduling & Execution articles.
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties.
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an Azure Blob Storage, you set the source type in the copy activity to BlobSource. Similarly, if you are moving data to an Azure Blob Storage, you set the sink type in the copy activity to BlobSink. This section provides a list of properties supported by BlobSource and BlobSink.
BlobSource supports the following properties in the typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
recursive | Indicates whether the data is read recursively from the sub folders or only from the specified folder. | True (default value), False | No |
BlobSink supports the following properties typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
copyBehavior | Defines the copy behavior when the source is BlobSource or FileSystem. | PreserveHierarchy: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder. FlattenHierarchy: all files from the source folder are in the first level of target folder. The target files have auto generated name. MergeFiles: merges all files from the source folder to one file. If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be auto-generated file name. |
No |
BlobSource also supports these two properties for backward compatibility.
- treatEmptyAsNull: Specifies whether to treat null or empty string as null value.
- skipHeaderLineCount - Specifies how many lines need be skipped. It is applicable only when input dataset is using TextFormat.
Similarly, BlobSink supports the following property for backward compatibility.
- blobWriterAddHeader: Specifies whether to add a header of column definitions while writing to an output dataset.
Datasets now support the following properties that implement the same functionality: treatEmptyAsNull, skipLineCount, firstRowAsHeader.
The following table provides guidance on using the new dataset properties in place of these blob source/sink properties.
Copy Activity property | Dataset property |
---|---|
skipHeaderLineCount on BlobSource | skipLineCount and firstRowAsHeader. Lines are skipped first and then the first row is read as a header. |
treatEmptyAsNull on BlobSource | treatEmptyAsNull on input dataset |
blobWriterAddHeader on BlobSink | firstRowAsHeader on output dataset |
See Specifying TextFormat section for detailed information on these properties.
This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values.
recursive | copyBehavior | Resulting behavior |
---|---|---|
true | preserveHierarchy | For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the same structure as the source Folder1 File1 File2 Subfolder1 File3 File4 File5. |
true | flattenHierarchy | For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 auto-generated name for File1 auto-generated name for File2 auto-generated name for File3 auto-generated name for File4 auto-generated name for File5 |
true | mergeFiles | For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target Folder1 is created with the following structure: Folder1 File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name |
false | preserveHierarchy | For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 File1 File2 Subfolder1 with File3, File4, and File5 are not picked up. |
false | flattenHierarchy | For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 auto-generated name for File1 auto-generated name for File2 Subfolder1 with File3, File4, and File5 are not picked up. |
false | mergeFiles | For a source folder Folder1 with the following structure: Folder1 File1 File2 Subfolder1 File3 File4 File5 the target folder Folder1 is created with the following structure Folder1 File1 + File2 contents are merged into one file with auto-generated file name. auto-generated name for File1 Subfolder1 with File3, File4, and File5 are not picked up. |
Let's look at how to quickly copy data to/from an Azure blob storage. In this walkthrough, both source and destination data stores of type: Azure Blob Storage. The pipeline in this walkthrough copies data from a folder to another folder in the same blob container. This walkthrough is intentionally simple to show you settings or properties when using Blob Storage as a source or sink.
- Create a general-purpose Azure Storage Account if you don't have one already. You use the blob storage as both source and destination data store in this walkthrough. if you don't have an Azure storage account, see the Create a storage account article for steps to create one.
- Create a blob container named adfblobconnector in the storage account.
- Create a folder named input in the adfblobconnector container.
- Create a file named emp.txt with the following content and upload it to the input folder by using tools such as Azure Storage Explorer
John, Doe Jane, Doe
- Sign in to the Azure portal.
- Click Create a resource from the top-left corner, click Intelligence + analytics, and click Data Factory.
- In the New data factory pane:
- Enter ADFBlobConnectorDF for the name. The name of the Azure data factory must be globally unique. If you receive the error:
*Data factory name “ADFBlobConnectorDF” is not available
, change the name of the data factory (for example, yournameADFBlobConnectorDF) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts. - Select your Azure subscription.
- For Resource Group, select Use existing to select an existing resource group (or) select Create new to enter a name for a resource group.
- Select a location for the data factory.
- Select Pin to dashboard check box at the bottom of the blade.
- Click Create.
- Enter ADFBlobConnectorDF for the name. The name of the Azure data factory must be globally unique. If you receive the error:
- After the creation is complete, you see the Data Factory blade as shown in the following image:
-
On the Data Factory home page, click the Copy data tile to launch Copy Data Wizard in a separate tab.
[!NOTE] If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third-party cookies and site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the wizard again.
-
In the Properties page:
- Enter CopyPipeline for Task name. The task name is the name of the pipeline in your data factory.
- Enter a description for the task (optional).
- For Task cadence or Task schedule, keep the Run regularly on schedule option. If you want to run this task only once instead of run repeatedly on a schedule, select Run once now. If you select, Run once now option, a one-time pipeline is created.
- Keep the settings for Recurring pattern. This task runs daily between the start and end times you specify in the next step.
- Change the Start date time to 04/21/2017.
- Change the End date time to 04/25/2017. You may want to type the date instead of browsing through the calendar.
- Click Next.
-
On the Source data store page, click Azure Blob Storage tile. You use this page to specify the source data store for the copy task. You can use an existing data store linked service (or) specify a new data store. To use an existing linked service, you would select FROM EXISTING LINKED SERVICES and select the right linked service.
-
On the Specify the Azure Blob storage account page:
- Keep the auto-generated name for Connection name. The connection name is the name of the linked service of type: Azure Storage.
- Confirm that From Azure subscriptions option is selected for Account selection method.
- Select your Azure subscription or keep Select all for Azure subscription.
- Select an Azure storage account from the list of Azure storage accounts available in the selected subscription. You can also choose to enter storage account settings manually by selecting Enter manually option for the Account selection method.
- Click Next.
-
On Choose the input file or folder page:
-
On the Choose the input file or folder page:
- Confirm that the file or folder is set to adfblobconnector/input. If the files are in sub folders, for example, 2017/04/01, 2017/04/02, and so on, enter adfblobconnector/input/{year}/{month}/{day} for file or folder. When you press TAB out of the text box, you see three drop-down lists to select formats for year (yyyy), month (MM), and day (dd).
- Do not set Copy file recursively. Select this option to recursively traverse through folders for files to be copied to the destination.
- Do not the binary copy option. Select this option to perform a binary copy of source file to the destination. Do not select for this walkthrough so that you can see more options in the next pages.
- Confirm that the Compression type is set to None. Select a value for this option if your source files are compressed in one of the supported formats.
- Click Next.
-
On the File format settings page, you see the delimiters and the schema that is auto-detected by the wizard by parsing the file.
- Confirm the following options:
a. The file format is set to Text format. You can see all the supported formats in the drop-down list. For example: JSON, Avro, ORC, Parquet. b. The column delimiter is set toComma (,)
. You can see the other column delimiters supported by Data Factory in the drop-down list. You can also specify a custom delimiter. c. The row delimiter is set toCarriage Return + Line feed (\r\n)
. You can see the other row delimiters supported by Data Factory in the drop-down list. You can also specify a custom delimiter. d. The skip line count is set to 0. If you want a few lines to be skipped at the top of the file, enter the number here. e. The first data row contains column names is not set. If the source files contain column names in the first row, select this option. f. The treat empty column value as null option is set. - Expand Advanced settings to see advanced option available.
- At the bottom of the page, see the preview of data from the emp.txt file.
- Click SCHEMA tab at the bottom to see the schema that the copy wizard inferred by looking at the data in the source file.
- Click Next after you review the delimiters and preview data.
- Confirm the following options:
-
On the Destination data store page, select Azure Blob Storage, and click Next. You are using the Azure Blob Storage as both the source and destination data stores in this walkthrough.
-
On Specify the Azure Blob storage account page:
- Enter AzureStorageLinkedService for the Connection name field.
- Confirm that From Azure subscriptions option is selected for Account selection method.
- Select your Azure subscription.
- Select your Azure storage account.
- Click Next.
-
On the Choose the output file or folder page:
- specify Folder path as adfblobconnector/output/{year}/{month}/{day}. Enter TAB.
- For the year, select yyyy.
- For the month, confirm that it is set to MM.
- For the day, confirm that it is set to dd.
- Confirm that the compression type is set to None.
- Confirm that the copy behavior is set to Merge files. If the output file with the same name already exists, the new content is added to the same file at the end.
- Click Next.
-
On the File format settings page, review the settings, and click Next. One of the additional options here is to add a header to the output file. If you select that option, a header row is added with names of the columns from the schema of the source. You can rename the default column names when viewing the schema for the source. For example, you could change the first column to First Name and second column to Last Name. Then, the output file is generated with a header with these names as column names.
-
On the Performance settings page, confirm that cloud units and parallel copies are set to Auto, and click Next. For details about these settings, see Copy activity performance and tuning guide.
-
On the Summary page, review all settings (task properties, settings for source and destination, and copy settings), and click Next.
-
Review information in the Summary page, and click Finish. The wizard creates two linked services, two datasets (input and output), and one pipeline in the data factory (from where you launched the Copy Wizard).
- Click the link
Click here to monitor copy pipeline
on the Deployment page. - You should see the Monitor and Manage application in a separate tab.
- Change the start time at the top to
04/19/2017
and end time to04/27/2017
, and then click Apply. - You should see five activity windows in the ACTIVITY WINDOWS list. The WindowStart times should cover all days from pipeline start to pipeline end times.
- Click Refresh button for the ACTIVITY WINDOWS list a few times until you see the status of all the activity windows is set to Ready.
- Now, verify that the output files are generated in the output folder of adfblobconnector container. You should see the following folder structure in the output folder:
For detailed information about monitoring and managing data factories, see Monitor and manage Data Factory pipeline article.
2017/04/21 2017/04/22 2017/04/23 2017/04/24 2017/04/25
Now, switch back to the tab with the Data Factory home page. Notice that there are two linked services, two datasets, and one pipeline in your data factory now.
Click Author and deploy to launch Data Factory Editor.
You should see the following Data Factory entities in your data factory:
- Two linked services. One for the source and the other one for the destination. Both the linked services refer to the same Azure Storage account in this walkthrough.
- Two datasets. An input dataset and an output dataset. In this walkthrough, both use the same blob container but refer to different folders (input and output).
- A pipeline. The pipeline contains a copy activity that uses a blob source and a blob sink to copy data from an Azure blob location to another Azure blob location.
The following sections provide more information about these entities.
You should see two linked services. One for the source and the other one for the destination. In this walkthrough, both definitions look the same except for the names. The type of the linked service is set to AzureStorage. Most important property of the linked service definition is the connectionString, which is used by Data Factory to connect to your Azure Storage account at runtime. Ignore the hubName property in the definition.
{
"name": "Source-BlobStorage-z4y",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
}
}
}
{
"name": "Destination-BlobStorage-z4y",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********"
}
}
}
For more information about Azure Storage linked service, see Linked service properties section.
There are two datasets: an input dataset and an output dataset. The type of the dataset is set to AzureBlob for both.
The input dataset points to the input folder of the adfblobconnector blob container. The external property is set to true for this dataset as the data is not produced by the pipeline with the copy activity that takes this dataset as an input.
The output dataset points to the output folder of the same blob container. The output dataset also uses the year, month, and day of the SliceStart system variable to dynamically evaluate the path for the output file. For a list of functions and system variables supported by Data Factory, see Data Factory functions and system variables. The external property is set to false (default value) because this dataset is produced by the pipeline.
For more information about properties supported by Azure Blob dataset, see Dataset properties section.
{
"name": "InputDataset-z4y",
"properties": {
"structure": [
{ "name": "Prop_0", "type": "String" },
{ "name": "Prop_1", "type": "String" }
],
"type": "AzureBlob",
"linkedServiceName": "Source-BlobStorage-z4y",
"typeProperties": {
"folderPath": "adfblobconnector/input/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
{
"name": "OutputDataset-z4y",
"properties": {
"structure": [
{ "name": "Prop_0", "type": "String" },
{ "name": "Prop_1", "type": "String" }
],
"type": "AzureBlob",
"linkedServiceName": "Destination-BlobStorage-z4y",
"typeProperties": {
"folderPath": "adfblobconnector/output/{year}/{month}/{day}",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
},
"partitionedBy": [
{ "name": "year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }
]
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": false,
"policy": {}
}
}
The pipeline has just one activity. The type of the activity is set to Copy. In the type properties for the activity, there are two sections, one for source and the other one for sink. The source type is set to BlobSource as the activity is copying data from a blob storage. The sink type is set to BlobSink as the activity copying data to a blob storage. The copy activity takes InputDataset-z4y as the input and OutputDataset-z4y as the output.
For more information about properties supported by BlobSource and BlobSink, see Copy activity properties section.
{
"name": "CopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"sink": {
"type": "BlobSink",
"copyBehavior": "MergeFiles",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset-z4y"
}
],
"outputs": [
{
"name": "OutputDataset-z4y"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Activity-0-Blob path_ adfblobconnector_input_->OutputDataset-z4y"
}
],
"start": "2017-04-21T22:34:00Z",
"end": "2017-04-25T05:00:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
The following examples provide sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. They show how to copy data to and from Azure Blob Storage and Azure SQL Database. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory.
The following sample shows:
- A linked service of type AzureSqlDatabase.
- A linked service of type AzureStorage.
- An input dataset of type AzureBlob.
- An output dataset of type AzureSqlTable.
- A pipeline with a Copy activity that uses BlobSource and SqlSink.
The sample copies time-series data from an Azure blob to an Azure SQL table hourly. The JSON properties used in these samples are described in sections following the samples.
Azure SQL linked service:
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Azure Storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs Data Factory that the table is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/",
"fileName": "{Hour}.csv",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure SQL output dataset:
The sample copies data to a table named “MyTable” in Azure SQL Database. Create the table in your SQL database with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour.
{
"name": "AzureSqlOutput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
A copy activity in a pipeline with Blob source and SQL sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoSQL",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
The following sample shows:
- A linked service of type AzureSqlDatabase.
- A linked service of type AzureStorage.
- An input dataset of type AzureSqlTable.
- An output dataset of type AzureBlob.
- A pipeline with Copy activity that uses SqlSource and BlobSink.
The sample copies time-series data from an Azure SQL table to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples.
Azure SQL linked service:
{
"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Azure Storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Data Factory supports two types of Azure Storage linked services: AzureStorage and AzureStorageSas. For the first one, you specify the connection string that includes the account key and for the later one, you specify the Shared Access Signature (SAS) Uri. See Linked Services section for details.
Azure SQL input dataset:
The sample assumes you have created a table “MyTable” in Azure SQL and it contains a column called “timestampcolumn” for time series data.
Setting “external”: ”true” informs Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureSqlInput",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSqlLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/",
"partitionedBy": [
{
"name": "Year",
"value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } }
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
A copy activity in a pipeline with SQL source and Blob sink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureSQLtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureSQLInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Note
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory.
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.