In this article, discover the key features and advantages of using Apache Spark in Microsoft Fabric. Explore its versatile language support, Spark settings, and rich library ecosystem. Learn how Microsoft Fabric seamlessly integrates Spark for interactive analysis and automated data processing, making it a powerful tool for data professionals.
Apache Spark has emerged as a leading open-source framework for large-scale data processing, especially in the realm of “big data”. Its adaptability across various platforms, including Azure HDInsight, Azure Databricks, Azure Synapse Analytics, and Microsoft Fabric, has contributed to its popularity.
Focusing on Microsoft Fabric, this powerful tool integrates seamlessly with Spark to enhance data ingestion, processing, and analysis in a lakehouse environment. The core techniques and code used in Spark are universally applicable across its different implementations. However, the unique advantage of Microsoft Fabric lies in its integration of Spark with other data services.
Explore more about Spark architecture and more in this blog.
Language Support: Spark supports various languages including Java, Scala, Spark R, Spark SQL, and PySpark. The latter two are most commonly used for data engineering and analytics tasks.
Spark Settings in Microsoft Fabric: Each workspace in Microsoft Fabric is assigned a Spark cluster, with settings manageable under the Data Engineering/Science section. Key settings include:
Libraries: Spark benefits from a rich ecosystem of open-source libraries, especially in Python, ensuring support for a wide array of tasks. Microsoft Fabric’s Spark clusters come pre-loaded with common libraries, and additional libraries can be managed on the library management page.
Microsoft Fabric offers two primary ways to work with Spark code: through notebooks and Spark job definitions.
Notebooks for Interactive Analysis:
Spark Job Definitions for Automated Processes:
In summary, Microsoft Fabric provides flexible options for both interactive data analysis and automated data processing with Spark, catering to different needs and workflows.
Working with data in Apache Spark often involves using dataframes, an essential part of the Spark SQL library. Unlike the low-level resilient distributed dataset (RDD), dataframes offer a more intuitive and structured approach to data manipulation, especially for those familiar with the popular Pandas library in Python.
%%pyspark
df = spark.read.load('Files/data/products.csv',
format='csv',
header=True
)
display(df.limit(10))
StatementMeta(, fbe6de24-8e0c-4c1a-9b70-0926d42e7ce7, 4, Finished, Available)
SynapseWidget(Synapse.DataFrame, f125a8e9-69c3-4c47-9979-bf3da50cdfd5)
from pyspark.sql.types import *
from pyspark.sql.functions import *
productSchema = StructType([
StructField("ProductID", IntegerType()),
StructField("ProductName", StringType()),
StructField("Category", StringType()),
StructField("ListPrice", FloatType())
])
df = spark.read.load('Files/data/products-data.csv',
format='csv',
schema=productSchema,
header=False)
display(df.limit(10))
StatementMeta(, 1a7b1439-f087-4da5-9cf4-c776ec4f2ad3, 7, Finished, Available)
SynapseWidget(Synapse.DataFrame, 3d6a2160-5bc9-43d5-85c3-353d091b0063)
pricelist_df = df.select("ProductID", "ListPrice")
bikes_df = df.select("ProductName", "Category", "ListPrice").where((df["Category"]=="Mountain Bikes") | (df["Category"]=="Road Bikes"))
display(bikes_df)
StatementMeta(, 1a7b1439-f087-4da5-9cf4-c776ec4f2ad3, 9, Finished, Available)
SynapseWidget(Synapse.DataFrame, 06901126-059b-4307-a513-4b0e3f42d07d)
counts_df = df.select("ProductID", "Category").groupBy("Category").count()
display(counts_df)
StatementMeta(, 1a7b1439-f087-4da5-9cf4-c776ec4f2ad3, 10, Finished, Available)
SynapseWidget(Synapse.DataFrame, 807d83e5-6264-455a-98e5-d13c6a757d5c)
bikes_df.write.mode("overwrite").parquet('Files/product_data/bikes.parquet')
StatementMeta(, 1a7b1439-f087-4da5-9cf4-c776ec4f2ad3, 11, Finished, Available)
bikes_df.write.partitionBy("Category").mode("overwrite").parquet("Files/bike_data")
StatementMeta(, 1a7b1439-f087-4da5-9cf4-c776ec4f2ad3, 12, Finished, Available)
road_bikes_df = spark.read.parquet('Files/bike_data/Category=Road Bikes')
df.createOrReplaceTempView("products_view")
StatementMeta(, fbe6de24-8e0c-4c1a-9b70-0926d42e7ce7, 5, Finished, Available)
df.write.format("delta").saveAsTable("products")
StatementMeta(, fbe6de24-8e0c-4c1a-9b70-0926d42e7ce7, 6, Finished, Available)
bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice FROM products WHERE Category IN ('Mountain Bikes', 'Road Bikes')")
display(bikes_df)
StatementMeta(, fbe6de24-8e0c-4c1a-9b70-0926d42e7ce7, 7, Finished, Available)
SynapseWidget(Synapse.DataFrame, d286dea1-945e-4c91-974e-172790a6e3f8)
Within notebooks, the %%sql magic command can be used to directly run SQL code. This is particularly useful for querying and displaying data from the catalog in a straightforward and readable format.
%%sql
SELECT Category, COUNT(ProductID) AS ProductCount
FROM products
GROUP BY Category
ORDER BY Category
StatementMeta(, fbe6de24-8e0c-4c1a-9b70-0926d42e7ce7, 14, Finished, Available)
<Spark SQL result set with 2 rows and 2 fields>
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# Generate random data for demonstration
categories = ['Category A', 'Category B', 'Category C', 'Category D', 'Category E']
product_counts = np.random.randint(1, 100, size=len(categories))
# Create a DataFrame from the generated data
data = pd.DataFrame({'Category': categories, 'ProductCount': product_counts})
# Clear the plot area
plt.clf()
# Create a Figure
fig = plt.figure(figsize=(12, 8))
# Create a bar plot of product counts by category
plt.bar(x=data['Category'], height=data['ProductCount'], color='orange')
# Customize the chart
plt.title('Product Counts by Category')
plt.xlabel('Category')
plt.ylabel('Products')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=70)
# Show the plot area
plt.show()
StatementMeta(, fbe6de24-8e0c-4c1a-9b70-0926d42e7ce7, 19, Finished, Available)
<Figure size 640x480 with 0 Axes>