Selecting the right VM size for your Azure HDInsight cluster
This article discusses how to select the right VM size for the various nodes in your HDInsight cluster.
Begin by understanding how the properties of a virtual machine such as CPU processing, RAM size, and network latency affect the processing of your workloads. Next, think about your application and how it matches with what different VM families are optimized for. Make sure that the VM family that you would like to use is compatible with the cluster type that you plan to deploy. For a list of all supported and recommended VM sizes for each cluster type, see Azure HDInsight supported node configurations. Lastly, you can use a benchmarking process to test some sample workloads and check which SKU within that family is right for you.
For more information on planning other aspects of your cluster such as selecting a storage type or cluster size, see Capacity planning for HDInsight clusters.
VM properties and big data workloads
The VM size and type are determined by CPU processing power, RAM size, and network latency:
CPU: The VM size dictates the number of cores. The more cores, the greater the degree of parallel computation each node can achieve. Also, some VM types have faster cores.
RAM: The VM size also dictates the amount of RAM available in the VM. For workloads that store data in memory for processing, rather than reading from disk, ensure your worker nodes have enough memory to fit the data.
Network: For most cluster types, the data processed by the cluster isn't on local disk, but rather in an external storage service such as Data Lake Storage or Azure Storage. Consider the network bandwidth and throughput between the node VM and the storage service. The network bandwidth available to a VM typically increases with larger sizes. For details, see VM sizes overview.
Understanding VM optimization
Virtual machine families in Azure are optimized to suit different use cases. In the table following, you can find some of the most popular use cases and the VM families that match to them.
Type | Sizes | Description |
---|---|---|
Entry-level | Av2 |
Have CPU performance and memory configurations best suited for entry level workloads like development and test. They're economical and provide a low-cost option to get started with Azure. |
General purpose | D , DSv2 , Dv2 |
Balanced CPU-to-memory ratio. Ideal for testing and development, small to medium databases, and low to medium traffic web servers. |
Compute optimized | F |
High CPU-to-memory ratio. Good for medium traffic web servers, network appliances, batch processes, and application servers. |
Memory optimized | Esv3 , Ev3 |
High memory-to-CPU ratio. Great for relational database servers, medium to large caches, and in-memory analytics. |
- For information about pricing of available VM instances across HDInsight supported regions, see HDInsight Pricing.
Cost saving VM types for light workloads
If you have light processing requirements, the F-series can be a good choice to get started with HDInsight. At a lower per-hour list price, the F-series is the best value in price-performance in the Azure portfolio based on the Azure Compute Unit (ACU) per vCPU.
The following table describes the cluster types and node types, which can be created with the Fsv2-series VMs.
Cluster Type | Version | Worker Node | Head Node | Zookeeper Node |
---|---|---|---|---|
Spark | All | F4 and above | no | no |
Hadoop | All | F4 and above | no | no |
Kafka | All | F4 and above | no | no |
HBase | All | F4 and above | no | no |
LLAP | disabled | no | no | no |
To see the specifications of each F-series SKU, see F-series VM sizes.
Benchmarking
Benchmarking is the process of running simulated workloads on different VMs to measure how well they perform for your production workloads.
For more information on benchmarking for VM SKUs and cluster sizes, see Cluster capacity planning in Azure HDInsight .
Next steps
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for