Azure

Azure Databricks

Azure Databricks is a fully-managed, Apache Spark based cloud-based platform, that can be used for Big Data processing and Machine Learning.

Based on Apache Spark, a distributed cluster computing framework.
Run and process a dataset on many computers simultaneously.
Databricks provides all the computing power.
Integrates with other Azure Storage services.

Top Features of Azure Databricks

Leverage spark for Streaming, ML, Graph API, and SQL/DataFrames
Multiple Languages: Scala, Python, Java, R, SQL
Integration with Azure Active Directory (Azure AD)
Integration with Azure Services e.g. Azure Data Factory, Azure Storage
Fulfills multiple Industry Security Compliances, e.g. PCIDSS, FedRAMP etc.

Key Components of Azure Databricks

Workspace - Apache Spark interactive workspace for exploration and visualization
Cluster - Apache Spark Cluster that can be created in seconds and autoscale and share across users
Notebook - Apache Spark Notebooks that can be used to read, write, query, explore, and visualize datasets

ETL in Azure Databricks

Combine Azure Databricks with Azure Data Factory
Benefits of Azure Data Factory
- Connect to 90+ other Data Sources
- Create Pipelines
- Schedule Pipelines as Jobs
- Visual Interface
Azure Databricks helps to Transform, Clean, and Join Disparate Data

Machine Learning in Azure Databricks

Databricks Runtime ML
Integrates with commonly used Open-Source Libraries
MLflow for end-to-end ML Lifecycle
Combine with Azure Machine Learning and Azure DevOps

Understanding Streaming in Spark

Structured Streaming
- Built on top of Spark SQL Engine
- Handles continuously streaming data
- Improvement from Apache Spark Streaming
Leverages DataFrame API
Can be queried with any SQL query
Use Cases - Real-time scenarios like:
- Sensors,
- IoT,
- social networks etc.

References