Management operations in Azure Managed Instance for Apache Cassandra

Azure Managed Instance for Apache Cassandra is a fully managed service for pure open-source Apache Cassandra clusters. The service also allows configurations to be overridden, depending on the specific needs of each workload, allowing maximum flexibility and control where needed. This article defines the management operations and features provided by the service. It also explains the separation of responsibilities between the Azure support team and customers when maintaining hybrid clusters.

Compaction

  • There are different types of compaction. We currently perform a minor compaction via repair (see Maintenance). This performs a Merkle tree compaction, which is a special kind of compaction.
  • Depending on the compaction strategy that was set on the table using CQL (for example WITH compaction = { 'class' : 'LeveledCompactionStrategy' }), Cassandra automatically compacts when the table reaches a specific size. We recommend that you carefully select a compaction strategy for your workload, and don't do any manual compactions outside the strategy.

Patching

  • Operating System-level patches are done automatically at approximately 2-week cadence.

  • Apache Cassandra software-level patches are done when security vulnerabilities are identified. The patching cadence may vary.

  • During patching, machines are rebooted one rack at a time. You shouldn't experience any degradation at the application side as long as quorum ALL setting is not being used, and the replication factor is 3 or higher.

  • The version in Apache Cassandra is in the format X.Y.Z. You can control the deployment of major (X) and minor (Y) versions manually via service tools. Whereas the Cassandra patches (Z) that may be required for that major/minor version combination are done automatically.

Note

The service currently supports Cassandra versions 3.11 and 4.0. Both versions are GA. See our Azure CLI Quickstart (step 5) for specifying Cassandra version during cluster deployment.

Maintenance

  • The Nodetool repair is automatically run by the service using reaper. This tool is run once every week. You may wish to disable it if using your own service for a hybrid deployment.

  • Node health monitoring consists of:

    • Actively monitoring each node's membership in the Cassandra ring.
    • Autodetecting, and automitigating infrastructure issues like virtual machine, network, storage, Linux, and support software failures.
    • Pro-actively monitoring CPU, disk, quorum loss, and other resource issues.
    • Automatically bringing up failed nodes where possible, and manually bringing up nodes in response to auto-generated warnings.

Support

Azure Managed Instance for Apache Cassandra provides an SLA for the availability of data centers in a managed cluster. If you encounter any issues with using the service, file a support request in the Azure portal.

Our support benefits include:

  • Single point of contact for Cassandra infrastructure issues - no need to raise support cases with IaaS teams (disk, compute, networking) separately.
  • Pro-active advise via email on performance bottle necks, sizing, and other resource constraint issues.
  • 24x7 support coverage, including autogenerated incidents for any severe outage issues.
  • Community approved patch support (see Patching).
  • In-house Java JDK/JVM engineering team support.
  • Linux Operating System support with software supply chain security.

Important

We will investigate and diagnose any issues reported via support case, and resolve or mitigate where possible. However, you are ultimately responsible for any Apache Cassandra configuration level usage which causes CPU, disk, or network problems.

Examples of such issues include:

  • Inefficient query operations.
  • Throughput that exceeds capacity.
  • Ingesting data that exceeds storage capacity.
  • Incorrect keyspace configuration settings.
  • Poor data model or partition key strategy.

In the event that we investigate a support case and discover that the root cause of the issue is at the Apache Cassandra configuration level (and not any underlying platform level aspects we maintain), we will still provide recommendations and guidance on remediation, or mitigation (when possible), before closing the case.

We recommend you enable metrics and/or become familiar with our Azure monitor integration in order to prevent common application/configuration level issues in Apache Cassandra, such as the above.

Warning

Azure Managed Instance for Apache Cassandra also let's you run nodetool and sstable commands for routine DBA administration - see article here. Some of these commands can destabilize the cassandra cluster and should only be run carefully and after being tested in non-production environments. Where possible, a --dry-run option should be deployed first. Microsoft cannot offer any SLA or support on issues with running commands which alter the default database configuration and/or tables.

Backup and restore

Snapshot backups are enabled by default and taken every 24 hours. Backups are stored in an internal Azure Blob Storage account and are retained for up to 2 days (48 hours). There's no cost for the initial 2 backups. Extra backups are charged, see pricing. To change the backup interval or retention period, you can edit the policy in the portal:

Screenshot of backup schedule configuration page.

To restore from an existing backup, file a support request in the Azure portal. When filing the support case, you need to:

  1. Provide the backup ID from portal for the backup you want to restore. This can be found in the portal:

    Screenshot of backup schedule configuration page highlighting backup ID.

  2. If restore of the whole cluster is not required, provide the keyspace and table (if applicable) that needs to be restored.

  3. Advise whether you want the backup to be restored in the existing cluster, or in a new cluster.

  4. If you want to restore to a new cluster, you need to create the new cluster first. Ensure that the target cluster matches the source cluster in terms of the number of data centers, and that corresponding data center has the same number of nodes. You can also decide whether to keep the credentials (username/password) in the new target cluster, or allow restore to override username/password with what was originally created.

  5. You can also decide whether to keep system_auth keyspace in the new target cluster or allow the restore to overwrite it with data from the backup. The system_auth keyspace in Cassandra contains authorization and internal authentication data, including roles, role permissions, and passwords. Note that our default restore process overwrites the system_auth keyspace.

Note

The time it takes to respond to a request to restore from backup will depend both on the severity of support case you raise (and it's corresponding SLA for response time), and the amount of data to be restored. However, we do not provide an SLA for time to complete the restore, as this is very dependent on the volume of data being restored.

Warning

Backups are intended for accidental deletion scenarios, and are not geo-redundant. They are therefore not recommended for use as a disaster recovery (DR) strategy in case of a total regional outage. To safeguard against region-wide outages, we recommend a multi-region deployment. Take a look at our quickstart for multi-region deployments.

Security

Azure Managed Instance for Apache Cassandra provides many built-in explicit security controls and features:

  • Hardened Linux Virtual Machine images with a controlled supply chain.
  • Common Vulnerability & Exposure (CVE) monitoring at the Operating System level.
  • Certificate rotation for both Apache Cassandra and Prometheus software hosted on the managed Virtual Machines.
  • Active vulnerability scanning.
  • Active virus scanning.
  • Secure coding practices.

For more information on security features, see our article here.

Hybrid support

When a hybrid cluster is configured, automated reaper operations running in the service benefits the whole cluster. This includes data centers that aren't provisioned by the service. Outside this, it is your responsibility to maintain your on-premises or externally hosted data center.

Next steps

Get started with one of our quickstarts: