Skip to content

Latest commit

 

History

History
94 lines (71 loc) · 7.47 KB

data-factory-data-transformation-activities.md

File metadata and controls

94 lines (71 loc) · 7.47 KB
title description services documentationcenter author ms.author manager ms.reviewer ms.service ms.workload ms.topic ms.date
Data Transformation: Process & transform data
Learn how to transform data or process data in Azure Data Factory using Hadoop, Azure Machine Learning Studio (classic), or Azure Data Lake Analytics.
data-factory
dcstwh
weetok
jroth
maghan
data-factory
data-services
conceptual
01/10/2018

Transform data in Azure Data Factory version 1

[!div class="op_single_selector"]

Overview

Note

This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see data transformation activities in Data Factory.

This article explains data transformation activities in Azure Data Factory that you can use to transform and processes your raw data into predictions and insights. A transformation activity executes in a computing environment such as Azure HDInsight cluster or an Azure Batch. It provides links to articles with detailed information on each transformation activity.

Data Factory supports the following data transformation activities that can be added to pipelines either individually or chained with another activity.

Note

For a walkthrough with step-by-step instructions, see Create a pipeline with Hive transformation article.

HDInsight Hive activity

The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Hive Activity article for details about this activity.

HDInsight Pig activity

The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. See Pig Activity article for details about this activity.

HDInsight MapReduce activity

The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. See MapReduce Activity article for details about this activity.

HDInsight Streaming activity

The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand Windows/Linux-based HDInsight cluster. See HDInsight Streaming activity for details about this activity.

HDInsight Spark Activity

The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster. For details, see Invoke Spark programs from Azure Data Factory.

Azure Machine Learning Studio (classic) activities

Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning Studio (classic) web service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory pipeline, you can invoke a Studio (classic) web service to make predictions on the data in batch.

Over time, the predictive models in the Studio (classic) scoring experiments need to be retrained using new input datasets. After you are done with retraining, you want to update the scoring web service with the retrained machine learning model. You can use the Update Resource Activity to update the web service with the newly trained model.

See Use Azure Machine Learning Studio (classic) activities for details about these Studio (classic) activities.

Stored procedure activity

You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in one of the following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server Database in your enterprise or an Azure VM. See Stored Procedure Activity article for details.

Data Lake Analytics U-SQL activity

Data Lake Analytics U-SQL Activity runs a U-SQL script on an Azure Data Lake Analytics cluster. See Data Analytics U-SQL Activity article for details.

.NET custom activity

If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities article for details.

You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script using Azure Data Factory.

Compute environments

You create a linked service for the compute environment and then use the linked service when defining a transformation activity. There are two types of compute environments supported by Data Factory.

  1. On-Demand: In this case, the computing environment is fully managed by Data Factory. It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. You can configure and control granular settings of the on-demand compute environment for job execution, cluster management, and bootstrapping actions.
  2. Bring Your Own: In this case, you can register your own computing environment (for example HDInsight cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data Factory service uses it to execute the activities.

See Compute Linked Services article to learn about compute services supported by Data Factory.

Summary

Azure Data Factory supports the following data transformation activities and the compute environments for the activities. The transformation activities can be added to pipelines either individually or chained with another activity.

Data transformation activity Compute environment
Hive HDInsight [Hadoop]
Pig HDInsight [Hadoop]
MapReduce HDInsight [Hadoop]
Hadoop Streaming HDInsight [Hadoop]
Azure Machine Learning Studio (classic) activities: Batch Execution and Update Resource Azure VM
Stored Procedure Azure SQL, Azure Synapse Analytics, or SQL Server
Data Lake Analytics U-SQL Azure Data Lake Analytics
DotNet HDInsight [Hadoop] or Azure Batch