Skip to content

Latest commit

 

History

History
380 lines (304 loc) · 17.4 KB

data-factory-http-connector.md

File metadata and controls

380 lines (304 loc) · 17.4 KB
title description services documentationcenter author manager ms.service ms.workload ms.topic ms.date ms.author robots
Move data from an HTTP source - Azure
Learn how to move data from an on-premises or cloud HTTP source by using Azure Data Factory.
data-factory
linda33wj
shwang
data-factory
data-services
conceptual
05/22/2018
jingwang
noindex

Move data from an HTTP source by using Azure Data Factory

[!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"]

Note

This article applies to version 1 of Data Factory. If you're using the current version of the Azure Data Factory service, see HTTP connector in V2.

This article outlines how to use Copy Activity in Azure Data Factory to move data from an on-premises or cloud HTTP endpoint to a supported sink data store. This article builds on Move data by using Copy Activity, which presents a general overview of data movement by using Copy Activity. The article also lists the data stores that Copy Activity supports as sources and sinks.

Data Factory currently supports only moving data from an HTTP source to other data stores. It doesn't support moving data from other data stores to an HTTP destination.

Supported scenarios and authentication types

You can use this HTTP connector to retrieve data from both a cloud and an on-premises HTTP/S endpoint by using the HTTP GET or POST methods. The following authentication types are supported: Anonymous, Basic, Digest, Windows, and ClientCertificate. Note the difference between this connector and the Web table connector. The Web table connector extracts table content from an HTML webpage.

When you copy data from an on-premises HTTP endpoint, you must install Data Management Gateway in the on-premises environment or in an Azure VM. To learn about Data Management Gateway and for step-by-step instructions on how to set up the gateway, see Moving data between on-premises locations and the cloud.

Get started

You can create a pipeline that has a copy activity to move data from an HTTP source by using different tools or APIs:

  • The easiest way to create a pipeline is to use the Copy Data wizard. For a quick walkthrough of creating a pipeline by using the Copy Data wizard, see Tutorial: Create a pipeline by using the Copy wizard.

  • You can also use the following tools to create a pipeline: the Visual Studio, Azure PowerShell, an Azure Resource Manager template, the .NET API, or the REST API. For step-by-step instructions on how to create a pipeline that has a copy activity, see the Copy Activity tutorial. For JSON samples that copy data from an HTTP source to Azure Blob storage, see JSON examples.

Linked service properties

The following table describes JSON elements that are specific to the HTTP linked service:

Property Description Required
type The type property must be set to Http. Yes
url The base URL to the web server. Yes
authenticationType Specifies the authentication type. Allowed values are Anonymous, Basic, Digest, Windows, and ClientCertificate.

Refer to later sections in this article for more properties and JSON samples for these authentication types.
Yes
enableServerCertificateValidation Specifies whether to enable server TLS/SSL certificate validation if the source is an HTTPS web server. When your HTTPS server uses a self-signed certificate, set this to false. No
(the default is true)
gatewayName The name of the Data Management Gateway instance to use to connect to an on-premises HTTP source. Yes, if you are copying data from an on-premises HTTP source
encryptedCredential The encrypted credential for accessing the HTTP endpoint. The value is autogenerated when you configure the authentication information in the Copy wizard or by using the ClickOnce dialog box. No
(apply only when you copy data from an on-premises HTTP server)

For details about setting credentials for an on-premises HTTP connector data source, see Move data between on-premises sources and the cloud by using Data Management Gateway.

Using Basic, Digest, or Windows authentication

Set authenticationType to Basic, Digest, or Windows. In addition to the generic HTTP connector properties described in the preceding sections, set the following properties:

Property Description Required
userName The user name to use to access the HTTP endpoint. Yes
password The password for the user (username). Yes

Example: Using Basic, Digest, or Windows authentication

{
    "name": "HttpLinkedService",
    "properties":
    {
        "type": "Http",
        "typeProperties":
        {
            "authenticationType": "basic",
            "url" : "https://en.wikipedia.org/wiki/",
            "userName": "user name",
            "password": "password"
        }
    }
}

Using ClientCertificate authentication

To use basic authentication, set authenticationType to ClientCertificate. In addition to the generic HTTP connector properties described in the preceding sections, set the following properties:

Property Description Required
embeddedCertData The Base64-encoded contents of binary data of the PFX file. Specify either embeddedCertData or certThumbprint
certThumbprint The thumbprint of the certificate that was installed on your gateway machine’s cert store. Apply only when you copy data from an on-premises HTTP source. Specify either embeddedCertData or certThumbprint
password The password that's associated with the certificate. No

If you use certThumbprint for authentication and the certificate is installed in the personal store of the local computer, grant read permissions to the gateway service:

  1. Open the Microsoft Management Console (MMC). Add the Certificates snap-in that targets Local Computer.
  2. Expand Certificates > Personal, and then select Certificates.
  3. Right-click the certificate from the personal store, and then select All Tasks >Manage Private Keys.
  4. On the Security tab, add the user account under which the Data Management Gateway Host Service is running, with read access to the certificate.

Example: Using a client certificate

This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate that is installed on the machine that has Data Management Gateway installed.

{
    "name": "HttpLinkedService",
    "properties":
    {
        "type": "Http",
        "typeProperties":
        {
            "authenticationType": "ClientCertificate",
            "url": "https://en.wikipedia.org/wiki/",
            "certThumbprint": "thumbprint of certificate",
            "gatewayName": "gateway name"
        }
    }
}

Example: Using a client certificate in a file

This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate file on the machine that has Data Management Gateway installed.

{
    "name": "HttpLinkedService",
    "properties":
    {
        "type": "Http",
        "typeProperties":
        {
            "authenticationType": "ClientCertificate",
            "url": "https://en.wikipedia.org/wiki/",
            "embeddedCertData": "Base64-encoded cert data",
            "password": "password of cert"
        }
    }
}

Dataset properties

Some sections of a dataset JSON file, such as structure, availability, and policy, are similar for all dataset types (Azure SQL Database, Azure Blob storage, Azure Table storage).

For a full list of sections and properties that are available for defining datasets, see Creating datasets.

The typeProperties section is different for each type of dataset. The typeProperties section provides information about the location of the data in the data store. The typeProperties section for a dataset of the Http type has the following properties:

Property Description Required
type The type of the dataset must be set to Http. Yes
relativeUrl A relative URL to the resource that contains the data. When the path isn't specified, only the URL that's specified in the linked service definition is used.

To construct a dynamic URL, you can use Data Factory functions and system variables. Example: relativeUrl: $$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart).
No
requestMethod The HTTP method. Allowed values are GET and POST. No
(default is GET)
additionalHeaders Additional HTTP request headers. No
requestBody The body for the HTTP request. No
format If you want to retrieve the data from an HTTP endpoint as-is without parsing it, skip the format setting.

If you want to parse the HTTP response content during copy, the following format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. For more information, see Text format, JSON format, Avro format, Orc format, and Parquet format.
No
compression Specify the type and level of compression for the data. Supported types: GZip, Deflate, BZip2, and ZipDeflate. Supported levels: Optimal and Fastest. For more information, see File and compression formats in Azure Data Factory. No

Example: Using the GET (default) method

{
  "name": "HttpSourceDataInput",
    "properties": {
    "type": "Http",
        "linkedServiceName": "HttpLinkedService",
        "typeProperties": {
          "relativeUrl": "XXX/test.xml",
          "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
        },
        "external": true,
        "availability": {
            "frequency": "Hour",
            "interval":  1
        }
    }
}

Example: Using the POST method

{
    "name": "HttpSourceDataInput",
    "properties": {
        "type": "Http",
        "linkedServiceName": "HttpLinkedService",
        "typeProperties": {
            "relativeUrl": "/XXX/test.xml",
       "requestMethod": "Post",
            "requestBody": "body for POST HTTP request"
        },
        "external": true,
        "availability": {
            "frequency": "Hour",
            "interval":  1
        }
    }
}

Copy Activity properties

Properties like name, description, input and output tables, and policy are available for all types of activities.

For a full list of sections and properties that are available for defining activities, see Creating pipelines.

Properties that are available in the typeProperties section of the activity vary with each activity type. For a copy activity, properties vary depending on the types of sources and sinks.

Currently, when the source in Copy Activity is of the HttpSource type, the following properties are supported:

Property Description Required
httpRequestTimeout The timeout (the TimeSpan value) for the HTTP request to get a response. It's the timeout to get a response, not the timeout to read response data. No
(default value: 00:01:40)

Supported file and compression formats

See File and compression formats in Azure Data Factory for more.

JSON examples

The following examples provide sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. The examples show how to copy data from an HTTP source to Azure Blob storage. However, data can be copied directly from any of the sources to any of the sinks that are supported by using Copy Activity in Azure Data Factory.

Example: Copy data from an HTTP source to Azure Blob storage

The Data Factory solution for this sample contains the following Data Factory entities:

The sample copies data from an HTTP source to an Azure blob every hour. The JSON properties used in these samples are described in sections that follow the samples.

HTTP linked service

This example uses the HTTP linked service with anonymous authentication. See HTTP linked service for different types of authentication you can use.

{
    "name": "HttpLinkedService",
    "properties":
    {
        "type": "Http",
        "typeProperties":
        {
            "authenticationType": "Anonymous",
            "url" : "https://en.wikipedia.org/wiki/"
        }
    }
}

Azure storage linked service

{
  "name": "AzureStorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>"
    }
  }
}

HTTP input dataset

Setting external to true informs the Data Factory service that the dataset is external to the data factory and isn't produced by an activity in the data factory.

{
  "name": "HttpSourceDataInput",
    "properties": {
    "type": "Http",
        "linkedServiceName": "HttpLinkedService",
        "typeProperties": {
            "relativeUrl": "$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)",
        "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
    },
        "external": true,
        "availability": {
            "frequency": "Hour",
            "interval":  1
        }
    }
}

Azure blob output dataset

Data is written to a new blob every hour (frequency: hour, interval: 1).

{
    "name": "AzureBlobOutput",
    "properties":
    {
        "type": "AzureBlob",
        "linkedServiceName": "AzureStorageLinkedService",
        "typeProperties":
        {
            "folderPath": "adfgetstarted/Movies"
        },
        "availability":
        {
            "frequency": "Hour",
            "interval": 1
        }
    }
}

Pipeline that uses a copy activity

The pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the source type is set to HttpSource and the sink type is set to BlobSink.

For the list of properties that HttpSource supports, see HttpSource.

{  
    "name":"SamplePipeline",
    "properties":{  
    "start":"2014-06-01T18:00:00",
    "end":"2014-06-01T19:00:00",
    "description":"pipeline with a copy activity",
    "activities":[  
      {
        "name": "HttpSourceToAzureBlob",
        "description": "Copy from an HTTP source to an Azure blob",
        "type": "Copy",
        "inputs": [
          {
            "name": "HttpSourceDataInput"
          }
        ],
        "outputs": [
          {
            "name": "AzureBlobOutput"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "HttpSource"
          },
          "sink": {
            "type": "BlobSink"
          }
        },
       "scheduler": {
          "frequency": "Hour",
          "interval": 1
        },
        "policy": {
          "concurrency": 1,
          "executionPriorityOrder": "OldestFirst",
          "retry": 0,
          "timeout": "01:00:00"
        }
      }
      ]
   }
}

Note

To map columns from a source dataset to columns from a sink dataset, see Mapping dataset columns in Azure Data Factory.

Performance and tuning

To learn about key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see the Copy Activity performance and tuning guide.