The solution described in this article combines a range of Azure services that will ingest, store, process, enrich, and serve data and insights from different sources (structured, semi-structured, unstructured, and streaming).
Architecture
Download a Visio file of this architecture.
Note
- The services covered by this architecture are only a subset of a much larger family of Azure services. Similar outcomes can be achieved by using other services or features that are not covered by this design.
- Specific business requirements for your analytics use case could require the use of different services or features that are not considered in this design.
Dataflow
The analytics use cases covered by the architecture are illustrated by the different data sources on the left-hand side of the diagram. Data flows through the solution from the bottom up as follows:
Note
In the following sections, Azure Data Lake is used as the home for data throughout the various stages of the data lifecycle. Azure Data Lake is organized by different layers and containers as follows:
- The Raw layer is the landing area for data coming in from source systems. As the name implies, data in this layer is in raw, unfiltered, and unpurified form.
- In the next stage of the lifecycle, data moves to the Enriched layer where data is cleaned, filtered, and possibly transformed.
- Data then moves to the Curated layer, which is where consumer-ready data is maintained.
Please refer to the Data lake zones and containers documentation for a full review of Azure Data Lake layers and containers and their uses.
Azure data services, cloud native HTAP with Azure Cosmos DB and Dataverse
Process
Azure Synapse Link for Azure Cosmos DB and Azure Synapse Link for Dataverse enable you to run near real-time analytics over operational and business application data, by using the analytics engines that are available from your Azure Synapse workspace: SQL Serverless and Spark Pools.
When using Azure Synapse Link for Azure Cosmos DB, use either a SQL Serverless query or a Spark Pool notebook. You can access the Azure Cosmos DB analytical store and then combine datasets from your near real-time operational data with data from your data lake or from your data warehouse.
When using Azure Synapse Link for Dataverse, use either a SQL Serverless query or a Spark Pool notebook. You can access the selected Dataverse tables and then combine datasets from your near real-time business applications data with data from your data lake or from your data warehouse.
Store
- The resulting datasets from your SQL Serverless queries can be persisted in your data lake. If you are using Spark notebooks, the resulting datasets can be persisted either in your data lake or data warehouse (SQL pool).
Serve
Load relevant data from the Azure Synapse SQL pool or data lake into Power BI datasets for data visualization and exploration. Power BI models implement a semantic model to simplify the analysis of business data and relationships. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.
Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.
Relational databases
Ingest
- Use Azure Synapse pipelines to pull data from a wide variety of databases, both on-premises and in the cloud. Pipelines can be triggered based on a pre-defined schedule, in response to an event, or can be explicitly called via REST APIs.
Store
Within the Raw data lake layer, organize your data lake following the best practices around which layers to create, what folder structures to use in each layer and what files format to use for each analytics scenario.
From the Azure Synapse pipeline, use a Copy data activity to stage the data copied from the relational databases into the raw layer of your Azure Data Lake Store Gen 2 data lake. You can save the data in delimited text format or compressed as Parquet files.
Process
Use either data flows, SQL serverless queries, or Spark notebooks to validate, transform, and move the datasets from the Raw layer, through the Enriched layer and into your Curated layer in your data lake.
- As part of your data transformations, you can invoke machine-training models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine-learning models can be consumed from Azure Cognitive Services or custom ML models from Azure ML.
Serve
You can serve your final dataset directly from the data lake Curated layer or you can use Copy Data activity to ingest the final dataset into your SQL pool tables using the COPY command for fast ingestion.
Load relevant data from the Azure Synapse SQL pool or data lake into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.
Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.
Semi-structured data sources
Ingest
Use Azure Synapse pipelines to pull data from a wide variety of semi-structured data sources, both on-premises and in the cloud. For example:
- Ingest data from file-based sources containing CSV or JSON files.
- Connect to No-SQL databases such as Azure Cosmos DB or MongoDB.
- Call REST APIs provided by SaaS applications that will function as your data source for the pipeline.
Store
Within the Raw data lake layer, organize your data lake following the best practices around which layers to create, what folder structures to use in each layer and what files format to use for each analytics scenario.
From the Azure Synapse pipeline, use a Copy data activity to stage the data copied from the semi-structured data sources into the raw layer of your Azure Data Lake Store Gen 2 data lake. Save data to preserve the original format, as acquired from the data sources.
Process
For batch/micro-batch pipelines, use either data flows, SQL serverless queries or Spark notebooks to validate, transform, and move your datasets into your Curated layer in your data lake. SQL Serverless queries expose underlying CSV, Parquet, or JSON files as external tables, so that they can be queried using T-SQL.
- As part of your data transformations, you can invoke machine-learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine-learning models can be consumed from Azure Cognitive Services or custom ML models from Azure ML.
For near real-time telemetry and time-series analytics scenarios, use Data Explorer pools to easily ingest, consolidate, and correlate logs and IoT events data across multiple data sources. With Data Explorer pools, you can use Kusto queries (KQL) to perform time-series analysis, geospatial clustering, and machine learning enrichment.
Serve
You can serve your final dataset directly from the data lake Curated layer or you can use Copy Data activity to ingest the final dataset into your SQL pool tables using the COPY command for fast ingestion.
Load relevant data from the Azure Synapse SQL pools, Data Explorer pools, or a data lake into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.
Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.
Non-structured data sources
Ingest
Use Azure Synapse pipelines to pull data from a wide variety of non-structured data sources, both on-premises and in the cloud. For example:
- Ingest video, image, audio, or free text from file-based sources that contain the source files.
- Call REST APIs provided by SaaS applications that will function as your data source for the pipeline.
Store
Within the Raw data lake layer, organize your data lake by following the best practices about which layers to create, what folder structures to use in each layer, and what files format to use for each analytics scenario.
From the Azure Synapse pipeline, use a Copy data activity to stage the data copied from the non-structured data sources into the raw layer of your Azure Data Lake Store Gen 2 data lake. Save data by preserving the original format, as acquired from the data sources.
Process
Use Spark notebooks to validate, transform, enrich, and move your datasets from the Raw layer, through the Enriched layer and into your Curated layer in your data lake.
- As part of your data transformations, you can invoke machine-learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine-learning models can be consumed from Azure Cognitive Services or custom ML models from Azure ML.
Serve
You can serve your final dataset directly from the data lake Curated layer or you can use Copy Data activity to ingest the final dataset into your data warehouse tables using the COPY command for fast ingestion.
Load relevant data from the Azure Synapse SQL pool or data lake into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.
Business analysts use Power BI reports and dashboards to analyze data and derive business insights.
Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.
Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.
Streaming
Ingest
- Use Azure Event Hubs or Azure IoT Hubs to ingest data streams generated by client applications or IoT devices. Event Hubs or IoT Hub will then ingest and store streaming data preserving the sequence of events received. Consumers can then connect to Event Hubs or IoT Hub endpoints and retrieve messages for processing.
Store
Within the Raw data lake layer, organize your data lake following the best practices around which layers to create, what folder structures to use in each layer and what files format to use for each analytics scenario.
Configure Event Hubs Capture or IoT Hub Storage Endpoints to save a copy of the events into the Raw layer of your Azure Data Lake Store Gen 2 data lake. This feature implements the "Cold Path" of the Lambda architecture pattern and allows you to perform historical and trend analysis on the stream data saved in your data lake using SQL Serverless queries or Spark notebooks following the pattern for semi-structured data sources described above.
Process
For real-time insights, use a Stream Analytics job to implement the "Hot Path" of the Lambda architecture pattern and derive insights from the stream data in transit. Define at least one input for the data stream coming from your Event Hubs or IoT Hub, one query to process the input data stream and one Power BI output to where the query results will be sent to.
- As part of your data processing with Stream Analytics, you can invoke machine-learning models to enrich your stream datasets and drive business decisions based on the predictions generated. These machine-learning models can be consumed from Azure Cognitive Services or from custom ML models in Azure Machine learning.
Use other Stream Analytics job outputs to send processed events to Azure Synapse SQL pools or Data Explorer pools for further analytics use cases.
For near real-time telemetry and time-series analytics scenarios, use Data Explorer pools to easily ingest IoT events directly from Event Hubs or IoT Hubs. With Data Explorer pools, you can use Kusto queries (KQL) to perform time-series analysis, geospatial clustering, and machine learning enrichment.
Serve
Business analysts then use Power BI real-time datasets and dashboard capabilities to visualize the fast changing insights generated by your Stream Analytics query.
Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.
Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms and web pages.
Components
The following Azure services have been used in the architecture:
- Azure Synapse Analytics
- Azure Data Lake Gen2
- Azure Cosmos DB
- Azure Cognitive Services
- Azure Machine Learning
- Azure Event Hubs
- Azure IoT Hub
- Azure Stream Analytics
- Microsoft Purview
- Azure Data Share
- Microsoft Power BI
- Microsoft Entra ID
- Microsoft Cost Management
- Azure Key Vault
- Azure Monitor
- Microsoft Defender for Cloud
- Azure DevOps
- Azure Policy
- GitHub
Alternatives
In the architecture above, Azure Synapse pipelines are responsible for data pipeline orchestration. Azure Data Factory pipelines also provide the same capabilities as described in this article.
Azure Databricks can also be used as the compute engine used to process structured and unstructured data directly on the data lake.
In the architecture above, Azure Stream Analytics is the service responsible for processing streaming data. Azure Synapse Spark pools and Azure Databricks can also be used to perform the same role through the execution of notebooks.
Azure HDInsight Kafka clusters can also be used to ingest streaming data and provide the right level of performance and scalability required by large streaming workloads.
You also can make use of Azure Functions to invoke Azure Cognitive Services or Azure Machine Learning custom ML models from an Azure Synapse pipeline.
For comparisons of other alternatives, see:
Scenario details
This example scenario demonstrates how to use Azure Synapse Analytics with the extensive family of Azure Data Services to build a modern data platform that's capable of handling the most common data challenges in an organization.
Potential use cases
This approach can also be used to:
- Establish a data product architecture, which consists of a data warehouse for structured data and a data lake for semi-structured and unstructured data. You can choose to deploy a single data product for centralized environments or multiple data products for distributed environments such as Data Mesh. See more information about Data Management and Data Landing Zones.
- Integrate relational data sources with other unstructured datasets, with the use of big data processing technologies.
- Use semantic modeling and powerful visualization tools for simpler data analysis.
- Share datasets within the organization or with trusted external partners.
- Implement knowledge mining solutions to extract valuable business information hidden in images, PDFs, documents, and so on.
Recommendations
Discover and govern
Data governance is a common challenge in large enterprise environments. On one hand, business analysts need to be able to discover and understand data assets that can help them solve business problems. On the other hand, Chief Data Officers want insights on privacy and security of business data.
Microsoft Purview
Use Microsoft Purview for data discovery and insights on your data assets, data classification, and sensitivity, which covers the entire organizational data landscape.
Microsoft Purview can help you maintain a business glossary with the specific business terminology required for users to understand the semantics of what datasets mean and how they are meant to be used across the organization.
You can register all your data sources and organize them into Collections, which also serves as a security boundary for your metadata.
Setup regular scans to automatically catalog and update relevant metadata about data assets in the organization. Microsoft Purview can also automatically add data lineage information based on information from Azure Data Factory or Azure Synapse pipelines.
Data classification and data sensitivity labels can be added automatically to your data assets based on pre-configured or customs rules applied during the regular scans.
Data governance professionals can use the reports and insights generated by Microsoft Purview to keep control over the entire data landscape and protect the organization against any security and privacy issues.
Platform services
In order to improve the quality of your Azure solutions, follow the recommendations and guidelines defined in the Azure Well-Architected Framework five pillars of architecture excellence: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security.
Following these recommendations, the services below should be considered as part of the design:
- Microsoft Entra ID: identity services, single sign-on and multi-factor authentication across Azure workloads.
- Microsoft Cost Management: financial governance over your Azure workloads.
- Azure Key Vault: secure credential and certificate management. For example, Azure Synapse Pipelines, Azure Synapse Spark Pools and Azure ML can retrieve credentials and certificates from Azure Key Vault used to securely access data stores.
- Azure Monitor: collect, analyze, and act on telemetry information of your Azure resources to proactively identify problems and maximize performance and reliability.
- Microsoft Defender for Cloud: strengthen and monitor the security posture of your Azure workloads.
- Azure DevOps & GitHub: implement DevOps practices to enforce automation and compliance to your workload development and deployment pipelines for Azure Synapse and Azure ML.
- Azure Policy: implement organizational standards and governance for resource consistency, regulatory compliance, security, cost, and management.
Considerations
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.
The technologies in this architecture were chosen because each of them provides the necessary functionality to handle the most common data challenges in an organization. These services meet the requirements for scalability and availability, while helping them control costs. The services covered by this architecture are only a subset of a much larger family of Azure services. Similar outcomes can be achieved by using other services or features not covered by this design.
Specific business requirements for your analytics use cases may also ask for the use of different services or features not considered in this design.
Similar architecture can also be implemented for pre-production environments where you can develop and test your workloads. Consider the specific requirements for your workloads and the capabilities of each service for a cost-effective pre-production environment.
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
In general, use the Azure pricing calculator to estimate costs. The ideal individual pricing tier and the total overall cost of each service included in the architecture is dependent on the amount of data to be processed and stored and the acceptable performance level expected. Use the guide below to learn more about how each service is priced:
Azure Synapse Analytics serverless architecture allows you to scale your compute and storage levels independently. Compute resources are charged based on usage, and you can scale or pause these resources on demand. Storage resources are billed per terabyte, so your costs will increase as you ingest more data.
Azure Data Lake Gen 2 is charged based on the amount of data stored and based on the number of transactions to read and write data.
Azure Event Hubs and Azure IoT Hubs are charged based on the amount of compute resources required to process your message streams.
Azure Machine Learning charges come from the amount of compute resources used to train and deploy your machine-learning models.
Cognitive Services is charged based on the number of call you make to the service APIs.
Microsoft Purview is priced based on the number of data assets in the catalog and the amount of compute power required to scan them.
Azure Stream Analytics is charged based on the amount of compute power required to process your stream queries.
Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included in the pricing sample above.
Azure Cosmos DB is priced based on the amount of storage and compute resources required by your databases.
Deploy this scenario
This deployment accelerator gives you the option to implement the entire reference architecture or choose what workloads you need for your analytics use case. You also have the option to select whether services are accessible via public endpoints or if they are to be accessed only via private endpoints.
Use the following button to deploy the reference using the Azure portal.
For detailed information and additional deployment options, see the deployment accelerator GitHub repo with documentation and code used to define this solution.
Contributors
This article is being updated and maintained by Microsoft. It was originally written by the following contributors.
Principal author:
- Fabio Braga | Principal MTC Technical Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Next steps
Review the guidelines defined in the Azure data management and analytics scenario for scalable analytics environment in Azure.
Explore the Data Engineer Learning Paths at Microsoft learn for further training content and labs on the services involved in this reference architecture.
Review the documentation and deploy the reference architecture using the deployment accelerator available from GitHub.