×

Azure Data Factory

The cloud is our future, data is our fortune, and the Data Factory is a great tool to let user’s data ‘GO TO THE CLOUD’ more rapidly in real life.

Azure Data Factory

Introduction 

The availability and accessibility of large volumes of data is one of the biggest concerns of almost every industry. But how does this impact a business when it’s transitioning to the cloud? 

Fortunately, Microsoft Azure has a platform known as Azure Data Factory, which allows customers to create a workflow that can ingest data from both on-premise and cloud data stores, and transform data by using compute services like Hadoop. Then, the outputs can be published to an on-premise or cloud data store for business intelligence (BI) applications to consume.

What is Azure Data Factory?

ADF(Azure Data Factory) is a cloud-based data integration service. It allows users to create data-driven workflows in the cloud for orchestrating and automating data movement and transformation.

ADF itself does not store any data. It only allows Users to create data-driven workflows to orchestrate the movement of data between supported data stores and processing of data using compute services in different regions or in an on-premise environment. It also allows users to monitor and manage workflows using both programmatic and UI mechanisms.

How Does Data Factory Work?

The Data Factory service allows users to create data pipelines that move and transform data and then run the pipelines on a specified schedule as per user’s requirement  e.g., hourly, daily, weekly, etc. Means the data that is consumed/produced by workflows, is time-sliced data which can specify the pipeline mode as scheduled (once a day) or one time.

The pipelines (data-driven workflows) in ADF simply  perform the following three steps:

Connect and Collect: Connect to all the required data sources and processing  e.g., SaaS services, file shares, FTP, and web services. As per requirement, move the data to a centralized location for subsequent processing by using the Copy operation in a data pipeline to relocate data from both on-premise and cloud source data stores to a centralized data repository in the cloud for further analysis.

Transform and Enrich: Once data is dumped to the centralized cloud, it is transformed using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.

Publish: Deliver transformed data from the cloud to on-premise sources, such as SQL Server or BI/analytics tools/other applications for consumption.

How the 4 Components Work Together

Below schema shows us the relationships between the Dataset, Activity, Pipeline, and Linked Services components:

Globally or Regionally?

Presently, users can create data factories in the West US, East US, and North Europe regions. However, a data factory can access data repositories and compute services in other Azure regions to relocate data between data repositories or process data using the compute services.




Trendy