title | description | services | documentationcenter | author | manager | ms.assetid | ms.service | ms.workload | ms.topic | ms.date | ms.author | robots |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Move data from Web Table using Azure Data Factory |
Learn about how to move data from a table in a Web page using Azure Data Factory. |
data-factory |
linda33wj |
shwang |
f54a26a4-baa4-4255-9791-5a8f935898e2 |
data-factory |
data-services |
conceptual |
01/05/2018 |
jingwang |
noindex |
[!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"]
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see Web table connector in V2.
This article outlines how to use the Copy Activity in Azure Data Factory to move data from a table in a Web page to a supported sink data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks.
Data factory currently supports only moving data from a Web table to other data stores, but not moving data from other data stores to a Web table destination.
Important
This Web connector currently supports only extracting table content from an HTML page. To retrieve data from a HTTP/s endpoint, use HTTP connector instead.
To use this Web table connector, you need to set up a Self-hosted Integration Runtime (aka Data Management Gateway) and configure the gatewayName
property in the sink linked service. For example, to copy from Web table to Azure Blob storage, configure the Azure Storage linked service as the following:
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>",
"gatewayName": "<gateway name>"
}
}
}
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs.
- The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
- You can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:
- Create linked services to link input and output data stores to your data factory.
- Create datasets to represent input and output data for the copy operation.
- Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from a web table, see JSON example: Copy data from Web table to Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Web table:
The following table provides description for JSON elements specific to Web linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: Web | Yes |
Url | URL to the Web source | Yes |
authenticationType | Anonymous. | Yes |
{
"name": "web",
"properties":
{
"type": "Web",
"typeProperties":
{
"authenticationType": "Anonymous",
"url" : "https://en.wikipedia.org/wiki/"
}
}
}
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type WebTable has the following properties
Property | Description | Required |
---|---|---|
type | type of the dataset. must be set to WebTable | Yes |
path | A relative URL to the resource that contains the table. | No. When path is not specified, only the URL specified in the linked service definition is used. |
index | The index of the table in the resource. See Get index of a table in an HTML page section for steps to getting index of a table in an HTML page. | Yes |
Example:
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": "WebLinkedService",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
Currently, when the source in copy activity is of type WebSource, no additional properties are supported.
The following sample shows:
- A linked service of type Web.
- A linked service of type AzureStorage.
- An input dataset of type WebTable.
- An output dataset of type AzureBlob.
- A pipeline with Copy Activity that uses WebSource and BlobSink.
The sample copies data from a Web table to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples.
The following sample shows how to copy data from a Web table to an Azure blob. However, data can be copied directly to any of the sinks stated in the Data Movement Activities article by using the Copy Activity in Azure Data Factory.
Web linked service This example uses the Web linked service with anonymous authentication. See Web linked service section for different types of authentication you can use.
{
"name": "WebLinkedService",
"properties":
{
"type": "Web",
"typeProperties":
{
"authenticationType": "Anonymous",
"url" : "https://en.wikipedia.org/wiki/"
}
}
}
Azure Storage linked service
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>",
"gatewayName": "<gateway name>"
}
}
}
WebTable input dataset Setting external to true informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
Note
See Get index of a table in an HTML page section for steps to getting index of a table in an HTML page.
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"linkedServiceName": "WebLinkedService",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Azure Blob output dataset
Data is written to a new blob every hour (frequency: hour, interval: 1).
{
"name": "AzureBlobOutput",
"properties":
{
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties":
{
"folderPath": "adfgetstarted/Movies"
},
"availability":
{
"frequency": "Hour",
"interval": 1
}
}
}
Pipeline with Copy activity
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to WebSource and sink type is set to BlobSink.
See WebSource type properties for the list of properties supported by the WebSource.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "WebTableToAzureBlob",
"description": "Copy from a Web table to an Azure blob",
"type": "Copy",
"inputs": [
{
"name": "WebTableInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "WebSource"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
-
Launch Excel 2016 and switch to the Data tab.
-
Click New Query on the toolbar, point to From Other Sources and click From Web.
-
In the From Web dialog box, enter URL that you would use in linked service JSON (for example: https://en.wikipedia.org/wiki/) along with path you would specify for the dataset (for example: AFI%27s_100_Years...100_Movies), and click OK.
URL used in this example: https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies
-
If you see Access Web content dialog box, select the right URL, authentication, and click Connect.
-
Click a table item in the tree view to see content from the table and then click Edit button at the bottom.
-
In the Query Editor window, click Advanced Editor button on the toolbar.
-
In the Advanced Editor dialog box, the number next to "Source" is the index.
If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page article for details. The steps are similar if you are using Microsoft Power BI for Desktop.
Note
To map columns from source dataset to columns from sink dataset, see Mapping dataset columns in Azure Data Factory.
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.