title | description | services | documentationcenter | author | manager | ms.service | ms.workload | ms.topic | ms.date | ms.author | robots |
---|---|---|---|---|---|---|---|---|---|---|---|
Move data from an HTTP source - Azure |
Learn how to move data from an on-premises or cloud HTTP source by using Azure Data Factory. |
data-factory |
linda33wj |
shwang |
data-factory |
data-services |
conceptual |
05/22/2018 |
jingwang |
noindex |
[!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"]
Note
This article applies to version 1 of Data Factory. If you're using the current version of the Azure Data Factory service, see HTTP connector in V2.
This article outlines how to use Copy Activity in Azure Data Factory to move data from an on-premises or cloud HTTP endpoint to a supported sink data store. This article builds on Move data by using Copy Activity, which presents a general overview of data movement by using Copy Activity. The article also lists the data stores that Copy Activity supports as sources and sinks.
Data Factory currently supports only moving data from an HTTP source to other data stores. It doesn't support moving data from other data stores to an HTTP destination.
You can use this HTTP connector to retrieve data from both a cloud and an on-premises HTTP/S endpoint by using the HTTP GET or POST methods. The following authentication types are supported: Anonymous, Basic, Digest, Windows, and ClientCertificate. Note the difference between this connector and the Web table connector. The Web table connector extracts table content from an HTML webpage.
When you copy data from an on-premises HTTP endpoint, you must install Data Management Gateway in the on-premises environment or in an Azure VM. To learn about Data Management Gateway and for step-by-step instructions on how to set up the gateway, see Moving data between on-premises locations and the cloud.
You can create a pipeline that has a copy activity to move data from an HTTP source by using different tools or APIs:
-
The easiest way to create a pipeline is to use the Copy Data wizard. For a quick walkthrough of creating a pipeline by using the Copy Data wizard, see Tutorial: Create a pipeline by using the Copy wizard.
-
You can also use the following tools to create a pipeline: the Visual Studio, Azure PowerShell, an Azure Resource Manager template, the .NET API, or the REST API. For step-by-step instructions on how to create a pipeline that has a copy activity, see the Copy Activity tutorial. For JSON samples that copy data from an HTTP source to Azure Blob storage, see JSON examples.
The following table describes JSON elements that are specific to the HTTP linked service:
Property | Description | Required |
---|---|---|
type | The type property must be set to Http. | Yes |
url | The base URL to the web server. | Yes |
authenticationType | Specifies the authentication type. Allowed values are Anonymous, Basic, Digest, Windows, and ClientCertificate. Refer to later sections in this article for more properties and JSON samples for these authentication types. |
Yes |
enableServerCertificateValidation | Specifies whether to enable server TLS/SSL certificate validation if the source is an HTTPS web server. When your HTTPS server uses a self-signed certificate, set this to false. | No (the default is true) |
gatewayName | The name of the Data Management Gateway instance to use to connect to an on-premises HTTP source. | Yes, if you are copying data from an on-premises HTTP source |
encryptedCredential | The encrypted credential for accessing the HTTP endpoint. The value is autogenerated when you configure the authentication information in the Copy wizard or by using the ClickOnce dialog box. | No (apply only when you copy data from an on-premises HTTP server) |
For details about setting credentials for an on-premises HTTP connector data source, see Move data between on-premises sources and the cloud by using Data Management Gateway.
Set authenticationType to Basic, Digest, or Windows. In addition to the generic HTTP connector properties described in the preceding sections, set the following properties:
Property | Description | Required |
---|---|---|
userName | The user name to use to access the HTTP endpoint. | Yes |
password | The password for the user (username). | Yes |
Example: Using Basic, Digest, or Windows authentication
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "basic",
"url" : "https://en.wikipedia.org/wiki/",
"userName": "user name",
"password": "password"
}
}
}
To use basic authentication, set authenticationType to ClientCertificate. In addition to the generic HTTP connector properties described in the preceding sections, set the following properties:
Property | Description | Required |
---|---|---|
embeddedCertData | The Base64-encoded contents of binary data of the PFX file. | Specify either embeddedCertData or certThumbprint |
certThumbprint | The thumbprint of the certificate that was installed on your gateway machine’s cert store. Apply only when you copy data from an on-premises HTTP source. | Specify either embeddedCertData or certThumbprint |
password | The password that's associated with the certificate. | No |
If you use certThumbprint for authentication and the certificate is installed in the personal store of the local computer, grant read permissions to the gateway service:
- Open the Microsoft Management Console (MMC). Add the Certificates snap-in that targets Local Computer.
- Expand Certificates > Personal, and then select Certificates.
- Right-click the certificate from the personal store, and then select All Tasks >Manage Private Keys.
- On the Security tab, add the user account under which the Data Management Gateway Host Service is running, with read access to the certificate.
Example: Using a client certificate
This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate that is installed on the machine that has Data Management Gateway installed.
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "ClientCertificate",
"url": "https://en.wikipedia.org/wiki/",
"certThumbprint": "thumbprint of certificate",
"gatewayName": "gateway name"
}
}
}
Example: Using a client certificate in a file
This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate file on the machine that has Data Management Gateway installed.
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "ClientCertificate",
"url": "https://en.wikipedia.org/wiki/",
"embeddedCertData": "Base64-encoded cert data",
"password": "password of cert"
}
}
}
Some sections of a dataset JSON file, such as structure, availability, and policy, are similar for all dataset types (Azure SQL Database, Azure Blob storage, Azure Table storage).
For a full list of sections and properties that are available for defining datasets, see Creating datasets.
The typeProperties section is different for each type of dataset. The typeProperties section provides information about the location of the data in the data store. The typeProperties section for a dataset of the Http type has the following properties:
Property | Description | Required |
---|---|---|
type | The type of the dataset must be set to Http. | Yes |
relativeUrl | A relative URL to the resource that contains the data. When the path isn't specified, only the URL that's specified in the linked service definition is used. To construct a dynamic URL, you can use Data Factory functions and system variables. Example: relativeUrl: $$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart). |
No |
requestMethod | The HTTP method. Allowed values are GET and POST. | No (default is GET) |
additionalHeaders | Additional HTTP request headers. | No |
requestBody | The body for the HTTP request. | No |
format | If you want to retrieve the data from an HTTP endpoint as-is without parsing it, skip the format setting. If you want to parse the HTTP response content during copy, the following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. For more information, see Text format, JSON format, Avro format, Orc format, and Parquet format. |
No |
compression | Specify the type and level of compression for the data. Supported types: GZip, Deflate, BZip2, and ZipDeflate. Supported levels: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. | No |
Example: Using the GET (default) method
{
"name": "HttpSourceDataInput",
"properties": {
"type": "Http",
"linkedServiceName": "HttpLinkedService",
"typeProperties": {
"relativeUrl": "XXX/test.xml",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Example: Using the POST method
{
"name": "HttpSourceDataInput",
"properties": {
"type": "Http",
"linkedServiceName": "HttpLinkedService",
"typeProperties": {
"relativeUrl": "/XXX/test.xml",
"requestMethod": "Post",
"requestBody": "body for POST HTTP request"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Properties like name, description, input and output tables, and policy are available for all types of activities.
For a full list of sections and properties that are available for defining activities, see Creating pipelines.
Properties that are available in the typeProperties section of the activity vary with each activity type. For a copy activity, properties vary depending on the types of sources and sinks.
Currently, when the source in Copy Activity is of the HttpSource type, the following properties are supported:
Property | Description | Required |
---|---|---|
httpRequestTimeout | The timeout (the TimeSpan value) for the HTTP request to get a response. It's the timeout to get a response, not the timeout to read response data. | No (default value: 00:01:40) |
See File and compression formats in Azure Data Factory for more.
The following examples provide sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. The examples show how to copy data from an HTTP source to Azure Blob storage. However, data can be copied directly from any of the sources to any of the sinks that are supported by using Copy Activity in Azure Data Factory.
Example: Copy data from an HTTP source to Azure Blob storage
The Data Factory solution for this sample contains the following Data Factory entities:
- A linked service of type HTTP.
- A linked service of type AzureStorage.
- An input dataset of type Http.
- An output dataset of type AzureBlob.
- A pipeline that has a copy activity that uses HttpSource and BlobSink.
The sample copies data from an HTTP source to an Azure blob every hour. The JSON properties used in these samples are described in sections that follow the samples.
This example uses the HTTP linked service with anonymous authentication. See HTTP linked service for different types of authentication you can use.
{
"name": "HttpLinkedService",
"properties":
{
"type": "Http",
"typeProperties":
{
"authenticationType": "Anonymous",
"url" : "https://en.wikipedia.org/wiki/"
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>"
}
}
}
Setting external to true informs the Data Factory service that the dataset is external to the data factory and isn't produced by an activity in the data factory.
{
"name": "HttpSourceDataInput",
"properties": {
"type": "Http",
"linkedServiceName": "HttpLinkedService",
"typeProperties": {
"relativeUrl": "$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Data is written to a new blob every hour (frequency: hour, interval: 1).
{
"name": "AzureBlobOutput",
"properties":
{
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties":
{
"folderPath": "adfgetstarted/Movies"
},
"availability":
{
"frequency": "Hour",
"interval": 1
}
}
}
The pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the source type is set to HttpSource and the sink type is set to BlobSink.
For the list of properties that HttpSource supports, see HttpSource.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with a copy activity",
"activities":[
{
"name": "HttpSourceToAzureBlob",
"description": "Copy from an HTTP source to an Azure blob",
"type": "Copy",
"inputs": [
{
"name": "HttpSourceDataInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "HttpSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Note
To map columns from a source dataset to columns from a sink dataset, see Mapping dataset columns in Azure Data Factory.
To learn about key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see the Copy Activity performance and tuning guide.