Azure Databricks
Azure Databricks is a fully-managed, Apache Spark based cloud-based platform, that can be used for Big Data processing and Machine Learning.
- Based on Apache Spark, a distributed cluster computing framework.
- Run and process a dataset on many computers simultaneously.
- Databricks provides all the computing power.
- Integrates with other Azure Storage services.
Top Features of Azure Databricks
- Leverage spark for Streaming, ML, Graph API, and SQL/DataFrames
- Multiple Languages: Scala, Python, Java, R, SQL
- Integration with Azure Active Directory (Azure AD)
- Integration with Azure Services e.g. Azure Data Factory, Azure Storage
- Fulfills multiple Industry Security Compliances, e.g. PCIDSS, FedRAMP etc.
Key Components of Azure Databricks
- Workspace - Apache Spark interactive workspace for exploration and visualization
- Cluster - Apache Spark Cluster that can be created in seconds and autoscale and share across users
- Notebook - Apache Spark Notebooks that can be used to read, write, query, explore, and visualize datasets
ETL in Azure Databricks
- Combine Azure Databricks with Azure Data Factory
- Benefits of Azure Data Factory
- Connect to 90+ other Data Sources
- Create Pipelines
- Schedule Pipelines as Jobs
- Visual Interface
- Azure Databricks helps to Transform, Clean, and Join Disparate Data
Machine Learning in Azure Databricks
- Databricks Runtime ML
- Integrates with commonly used Open-Source Libraries
- MLflow for end-to-end ML Lifecycle
- Combine with Azure Machine Learning and Azure DevOps
Understanding Streaming in Spark
- Structured Streaming
- Built on top of Spark SQL Engine
- Handles continuously streaming data
- Improvement from Apache Spark Streaming
- Leverages DataFrame API
- Can be queried with any SQL query
- Use Cases - Real-time scenarios like:
- Sensors,
- IoT,
- social networks etc.