EExplore the power of data pipelines in Microsoft Fabric. Learn how these pipelines streamline data transfer and transformation, using activities, parameters, and runs for efficient data management. Discover a code-minimal approach to handling data with flexibility and ease.
Data pipelines are essential tools that automate the process of transferring and transforming data. They extract data from various sources and load it into analytical stores like lakehouses or data warehouses, often modifying the data along the way. For those familiar with Azure Data Factory, Microsoft Fabric’s data pipelines will seem familiar. They use a similar structure of linked activities for diverse data processing tasks and logic control. These pipelines can be operated interactively through Microsoft Fabric’s interface or set to run automatically.
Microsoft Fabric offers a powerful tool for data movement and processing tasks: pipelines. These pipelines allow you to define and orchestrate a series of activities, from data transfer to complex transformations, all with minimal coding required. Here’s a quick overview of the key concepts:
Activities: These are the core of your pipeline, representing the tasks executed. There are two main types:
Parameters: Pipelines can be customized with parameters, enhancing their flexibility and reusability. Parameters allow you to specify certain values each time the pipeline runs, like choosing a folder for data storage.
Pipeline Runs: Every execution of a pipeline results in a pipeline run. These runs can be scheduled or initiated on-demand, with each having a unique ID for tracking and reviewing purposes.
In essence, Microsoft Fabric’s pipelines provide a streamlined, code-minimal approach to managing and transforming data, adaptable to various needs through parameters and controlled via user-friendly activities.
https://raw.githubusercontent.com/MicrosoftLearning/dp-data/main/sales.csv
, using basic authentication without username or password.table_name
.table_name = "sales"
StatementMeta(, 2bf3e420-cde0-471d-ab4e-bcd863565b2b, 3, Finished, Available)
Below code loads the data from the sales.csv file that was ingested by the Copy Data activity, applies some transformation logic, and saves the transformed data as a table - appending the data if the table already exists.
from pyspark.sql.functions import *
# Read the new sales data
df = spark.read.format("csv").option("header","true").load("Files/new_data/*.csv")
## Add month and year columns
df = df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", month(col("OrderDate")))
# Derive FirstName and LastName columns
df = df.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1))
# Filter and reorder columns
df = df["SalesOrderNumber", "SalesOrderLineNumber", "OrderDate", "Year", "Month", "FirstName", "LastName", "EmailAddress", "Item", "Quantity", "UnitPrice", "TaxAmount"]
# Load the data into a table
df.write.format("delta").mode("append").saveAsTable(table_name)
StatementMeta(, 2bf3e420-cde0-471d-ab4e-bcd863565b2b, 4, Finished, Available)