This article outlines a manageable solution for making large volumes of geospatial data available for analytics.
Architecture
Download a Visio file of this architecture.
The diagram contains several gray boxes, each with a different label. From left to right, the labels are Ingest, Prepare, Load, Serve, and Visualize and explore. A final box underneath the others has the label Monitor and secure. Each box contains icons that represent various Azure services. Numbered arrows connect the boxes in the way that the steps describe in the diagram explanation.
Workflow
IoT data enters the system:
- Azure Event Hubs ingests streams of IoT data. The data contains coordinates or other information that identifies locations of devices.
- Event Hubs uses Azure Databricks for initial stream processing.
- Event Hubs stores the data in Azure Data Lake Storage.
GIS data enters the system:
Azure Data Factory ingests raster GIS data and vector GIS data of any format.
- Raster data consists of grids of values. Each pixel value represents a characteristic like the temperature or elevation of a geographic area.
- Vector data represents specific geographic features. Vertices, or discrete geometric locations, make up the vectors and define the shape of each spatial object.
Data Factory stores the data in Data Lake Storage.
Spark clusters in Azure Databricks use geospatial code libraries to transform and normalize the data.
Data Factory loads the prepared vector and raster data into Azure Database for PostgreSQL. The solution uses the PostGIS extension with this database.
Data Factory loads the prepared vector and raster data into Azure Data Explorer.
Azure Database for PostgreSQL stores the GIS data. APIs make this data available in standardized formats:
- GeoJSON is based on JavaScript Object Notation (JSON). GeoJSON represents simple geographical features and their non-spatial properties.
- Well-known text (WKT) is a text markup language that represents vector geometry objects.
- Vector tiles are packets of geographic data. Their lightweight format improves mapping performance.
A Redis cache improves performance by providing quick access to the data.
The Web Apps feature of Azure App Service works with Azure Maps to create visuals of the data.
Users analyze the data with Azure Data Explorer. GIS features of this tool create insightful visualizations. Examples include creating scatterplots from geospatial data.
Power BI provides customized reports and business intelligence (BI). The Azure Maps visual for Power BI highlights the role of location data in business results.
Throughout the process:
- Azure Monitor collects information on events and performance.
- Log Analytics runs queries on Monitor logs and analyzes the results.
- Azure Key Vault secures passwords, connection strings, and secrets.
Components
Azure Event Hubs is a fully managed streaming platform for big data. This platform as a service (PaaS) offers a partitioned consumer model. Multiple applications can use this model to process the data stream at the same time.
Azure Data Factory is an integration service that works with data from disparate data stores. You can use this fully managed, serverless platform to create, schedule, and orchestrate data transformation workflows.
Azure Databricks is a data analytics platform. Its fully managed Spark clusters process large streams of data from multiple sources. Azure Databricks can transform geospatial data at large scale for use in analytics and data visualization.
Data Lake Storage is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data typically comes from multiple, heterogeneous sources and can be structured, semi-structured, or unstructured.
Azure Database for PostgreSQL is a fully managed relational database service that's based on the community edition of the open-source PostgreSQL database engine.
PostGIS is an extension for the PostgreSQL database that integrates with GIS servers. PostGIS can run SQL location queries that involve geographic objects.
Redis is an open-source, in-memory data store. Redis caches keep frequently accessed data in server memory. The caches can then quickly process large volumes of application requests that use the data.
Power BI is a collection of software services and apps. You can use Power BI to connect unrelated sources of data and create visuals of them.
The Azure Maps visual for Power BI provides a way to enhance maps with spatial data. You can use this visual to show how location data affects business metrics.
Azure App Service and its Web Apps feature provide a framework for building, deploying, and scaling web apps. The App Service platform offers built-in infrastructure maintenance, security patching, and scaling.
GIS data APIs in Azure Maps store and retrieve map data in formats like GeoJSON and vector tiles.
Azure Data Explorer is a fast, fully managed data analytics service that can work with large volumes of data. This service originally focused on time series and log analytics. It now also handles diverse data streams from applications, websites, IoT devices, and other sources. Geospatial functionality in Azure Data Explorer provides options for rendering map data.
Azure Monitor collects data on environments and Azure resources. This diagnostic information is helpful for maintaining availability and performance. Two data platforms make up Monitor:
- Azure Monitor Logs records and stores log and performance data.
- Azure Monitor Metrics collects numerical values at regular intervals.
Log Analytics is an Azure portal tool that runs queries on Monitor log data. Log Analytics also provides features for charting and statistically analyzing query results.
Key Vault stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates.
Alternatives
Instead of developing your own APIs, consider using Martin. This open-source tile server makes vector tiles available to web apps. Written in Rust, Martin connects to PostgreSQL tables. You can deploy it as a container.
If your goal is to provide a standardized interface for GIS data, consider using GeoServer. This open framework implements industry-standard Open Geospatial Consortium (OGC) protocols such as Web Feature Service (WFS). It also integrates with common spatial data sources. You can deploy GeoServer as a container on a virtual machine. When customized web apps and exploratory queries are secondary, GeoServer provides a straightforward way to publish geospatial data.
Various Spark libraries are available for working with geospatial data on Azure Databricks. This solution uses these libraries:
But other solutions also exist for processing and scaling geospatial workloads with Azure Databricks.
Vector tiles provide an efficient way to display GIS data on maps. This solution uses PostGIS to dynamically query vector tiles. This approach works well for simple queries and result sets that contain well under 1 million records. But in the following cases, a different approach may be better:
- Your queries are computationally expensive.
- Your data doesn't change frequently.
- You're displaying large data sets.
In these situations, consider using Tippecanoe to generate vector tiles. You can run Tippecanoe as part of your data processing flow, either as a container or with Azure Functions. You can make the resulting tiles available through APIs.
Like Event Hubs, Azure IoT Hub can ingest large amounts of data. But IoT Hub also offers bi-directional communication capabilities with devices. If you receive data directly from devices but also send commands and policies back to devices, consider IoT Hub instead of Event Hubs.
To streamline the solution, omit these components:
- Azure Data Explorer
- Power BI
Scenario details
Many possibilities exist for working with geospatial data, or information that includes a geographic component. For instance, geographic information system (GIS) software and standards are widely available. These technologies can store, process, and provide access to geospatial data. But it's often hard to configure and maintain systems that work with geospatial data. You also need expert knowledge to integrate those systems with other systems.
This article outlines a manageable solution for making large volumes of geospatial data available for analytics. The approach is based on Advanced Analytics Reference Architecture and uses these Azure services:
- Azure Databricks with GIS Spark libraries processes data.
- Azure Database for PostgreSQL queries data that users request through APIs.
- Azure Data Explorer runs fast exploratory queries.
- Azure Maps creates visuals of geospatial data in web applications.
- The Azure Maps Power BI visual feature of Power BI provides customized reports
Potential use cases
This solution applies to many areas:
- Processing, storing, and providing access to large amounts of raster data, such as maps or climate data.
- Identifying the geographic position of enterprise resource planning (ERP) system entities.
- Combining entity location data with GIS reference data.
- Storing Internet of Things (IoT) telemetry from moving devices.
- Running analytical geospatial queries.
- Embedding curated and contextualized geospatial data in web apps.
Considerations
The following considerations, based on the Microsoft Azure Well-Architected Framework, apply to this solution.
Availability
Event Hubs spreads failure risk across clusters.
- Use a namespace with availability zones turned on to spread risk across three physically separated facilities.
- Consider using the geo-disaster recovery feature of Event Hubs. This feature replicates the entire configuration of a namespace from a primary to a secondary namespace.
See business continuity features that Azure Database for PostgreSQL offers. These features cover a range of recovery objectives.
App Service diagnostics alerts you to problems in apps, such as downtime. Use this service to identify, troubleshoot, and resolve issues like outages.
Consider using App Service to back up application files. But be careful with backed-up files, which include app settings in plain text. Those settings can contain secrets like connection strings.
Scalability
This solution's implementation meets these conditions:
- Processes up to 10 million data sets per day. The data sets include batch or streaming events.
- Stores 100 million data sets in an Azure Database for PostgreSQL database.
- Queries 1 million or fewer data sets at the same time. A maximum of 30 users run the queries.
The environment uses this configuration:
- An Azure Databricks cluster with four F8s_V2 worker nodes.
- A memory-optimized instance of Azure Database for PostgreSQL.
- An App Service plan with two Standard S2 instances.
Consider these factors to determine which adjustments to make for your implementation:
- Your data ingestion rate.
- Your volume of data.
- Your query volume.
- The number of parallel queries you need to support.
You can scale Azure components independently:
Event Hubs automatically scales up to meet usage needs. But take steps to manage throughput units and optimize partitions.
Data Factory handles large amounts of data. Its serverless architecture supports parallelism at different levels.
Azure Database for PostgreSQL offers high-performance horizontal scaling.
Azure Data Explorer can elastically scale to terabytes of data in minutes.
The autoscale feature of Monitor also provides scaling functionality. You can configure this feature to add resources to handle increases in load. It can also remove resources to save money.
Security
Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.
Protect vector tile data. Vector tiles embed coordinates and attributes for multiple entities in one file. If you generate vector tiles, use a dedicated set of tiles for each permission level in your access control system. With this approach, only users within each permission level have access to that level's data file.
To improve security, use Key Vault in these situations:
See Security in Azure App Service for information on how App Service helps secure web apps. Also consider these points:
Cost optimization
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
- To estimate the cost of implementing this solution, see a sample cost profile. This profile is for a single implementation of the environment described in Scalability considerations. It doesn't include the cost of Azure Data Explorer.
- To adjust the parameters and explore the cost of running this solution in your environment, use the Azure pricing calculator.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
- Richard Bumann | Solution Architect
Next steps
Product documentation:
- About Azure Event Hubs
- Azure Databricks concepts
- Introduction to Azure Data Lake Storage
- What is Azure Data Factory?
- Azure App Service overview
To start implementing this solution, see this information:
- Connect a WFS to Azure Maps
- Process OpenStreetMap data with Spark.
- Explore ways to display data with Azure Maps.
Information on processing geospatial data
- Functions for querying PostGIS for vector tiles
- Functions for loading PostGIS rasters
- Azure Data Explorer geospatial functions
- Data sources for vector tiles in Azure Maps
- Approaches for processing geospatial data in Databricks
Related resources
Related architectures
- Big data analytics with Azure Data Explorer
- Health data consortium on Azure
- [DataOps for the modern data warehouse][DataOps for the modern data warehouse]
- Azure Data Explorer interactive analytics
- Geospatial reference architecture - Azure Orbital
- Geospatial analysis for telecom
- Spaceborne data analysis with Azure Synapse Analytics
Related guides
- Compare the machine learning products and technologies from Microsoft - Azure Databricks
- Machine learning operations (MLOps) framework to scale up machine learning lifecycle with Azure Machine Learning
- [Azure Machine Learning decision guide for optimal tool selection][Azure Machine Learning decision guide for optimal tool selection]
- Monitor Azure Databricks