Project Flash update: Advancing Azure Virtual Machine availability monitoring

Flash, as the project is internally known, derives its name from our steadfast commitment to building a robust, reliable, and rapid mechanism for customers to monitor virtual machine (VM) health.

Our primary objective is to ensure customers can reliably access actionable and precise telemetry, promptly receive alerts on changes, and periodically monitor data at scale. We also place strong emphasis on developing a centralized and coherent experience that customers can conveniently use to meet their unique observability requirements.

Secure Virtual Machine health with Azure

To get started on your observability journey, you can explore the suite of Azure products to which we emit high-quality VM health data. These products include resource health, activity logs, Azure resource graph, Azure Monitor metrics, and Azure event grid.

We’re thrilled to reveal the exciting developments our team has been crafting over the past year! Here’s a glimpse of what we’ve been working on:

  • Improved VM availability monitoring: We’ve introduced a new feature that keeps a watchful eye for degradation in VM availability. It proactively warns you of potential impact to availability or performance.
  • Public preview of HealthResources event grid: We’re launching a public preview of HealthResources event grid system topic. This feature offers low-latency notifications on VM availability changes, empowering you to take quick mitigation actions when needed.
  • Enhanced visibility into application freezes: We’re now sending notifications when application freezes occur during select network and storage agent updates. This enhanced visibility helps you manage disruptions with greater clarity.

Our commitment to quality remains unwavering. We aim to maintain 100 percent data consistency and uphold rigorous quality standards across all Flash experiences.

“Last year, we provided an update on Project Flash in the Advancing Reliability blog series, emphasizing our dedication to empower Azure customers diagnose disruptions to virtual machine (VM) availability conveniently and swiftly. Today, we’re thrilled to share the latest advancements in improving VM availability monitoring for customers to rely on confidently for seamless operation of their workloads on Azure. I’ve asked Senior Technical Program Manager, Pujitha Desiraju, from the Azure Core Platform Fundamentals team to share the latest investments made as part Project Flash.”—Mark Russinovich, CTO, Azure.

Introducing degraded VM availability state for improved VM availability monitoring

As a result of our ongoing efforts to enhance VM health detection, we’re excited to reveal a significant improvement in quality with the introduction of the degraded VM availability state. This new feature harnesses machine learning-based anomaly detection models to predict VM degradations due to hardware issues affecting the underlying host server, such as central processing unit (CPU), disk, or memory problems. We have seamlessly integrated this feature into Azure resource graph, event grid, resource health, and activity logs, complementing the already flowing VM health annotations.

With the addition of this feature, monitoring your VM’s health and understanding why it’s degraded has become easier than ever. The views provided across all Flash experiences improve the ease of discovering whether the VM degradation is a result of a planned or unplanned event.  The views also effectively pinpoint the specific component responsible, offer actionable mitigation steps, and provide a precise redeployment date to avoid any operational disruptions.

Looking forward to 2024, we plan to expand our focus to encompass inoperable accelerated networking and new scenarios of hardware failure predictions. Additionally, we plan to incorporate the degraded state as a dimension within the VM availability metric in Azure Monitor, enhancing the accuracy of downtime attribution.

Public preview of low-latency event grid notifications on VM availability changes

To ensure seamless operation of business-critical applications, it’s crucial to have real time awareness of any event that might adversely impact VM availability. This awareness enables you to swiftly take remedial actions to shield end-users from any disruption. To support you in your daily operations, we’re delighted to announce the public preview of the HealthResources event grid system topic with newly added VM health annotations!

This system topic provides in-depth VM health data, giving you immediate insights into changes in VM availability states along with the necessary context. You can receive events on single-instance VMs and Virtual Machine Scale Set VMs for the Azure subscription on which this topic has been created. Data is published to this topic by Azure Resource Notifications (ARN), our state-of-the-art publisher-subscriber service, equipped with robust Role-Based Access Control (RBAC) and advanced filtering capabilities. This empowers you to effortlessly subscribe to an event grid system topic and seamlessly direct relevant events utilizing the advanced filtering capabilities provided by event grid, to downstream tools in real-time. This enables you to respond and mitigate issues instantly.

Getting started

Step 1:

Users start by creating a system topic within the Azure subscription for which they want to receive notifications to.

Step 2:

Users then proceed to create an event subscription within the system topic in Step 1. During this step, they’ll specify the endpoint (such as, Event Hubs) to which the events will be routed.  Users also have the option to configure event filters to narrow down the scope of delivered events. 

As you start subscribing to events from the HealthResources system topic, consider the following best practices:

  1. Choose an appropriate destination or event handler based on the anticipated scale and size of events.
  2. For fan-in scenarios where notifications from multiple system topics need to be consolidated, event hubs are highly recommended as a destination. This is especially useful for real-time processing scenarios to maintain data freshness and for periodic processing for analytics, with configurable retention periods.

Looking ahead to 2024, we have plans to transition the preview into a fully-fledged general availability feature.

Enhanced visibility into Application freezes

It’s crucial to have visibility into events that might require a system reboot or those that could lead to system freezes, especially when running sensitive workloads. We’re thrilled to introduce VM health annotations on occurred freeze impact, in specific scenarios of planned network and storage agent updates. These notifications are delivered to resource health, Azure resource graph, and event grid.

With this new feature, you’ll have access to detailed insights regarding the impact and attribution of system freezes. This information includes whether the activity was planned or unplanned, whether it was successfully completed, the precise duration of the impact as observed by you, and details about the type of update applied. This empowers you to monitor and investigate observed application freezes while also receiving targeted alerts for any freeze events.

Looking ahead to 2024, we’re committed to expanding the range of scenarios for which these notifications are emitted.

Flash solution summary

The Flash initiative has been dedicated to developing solutions over the years that cater to the diverse monitoring needs of our customers. To help you determine the most suitable Flash monitoring solution(s) for your specific requirements, refer below:

Azure resource graph—HealthResources

Currently generally availabile. It is particularly useful for conducting large-scale investigations. It offers a highly user-friendly experience for information retrieval with its use of kusto query language (KQL). It can also serve as a central hub for resource information and allows easy retrieval of historical data.

Azure event grid system topic—HealthResources

Currently in public preview. It is useful for triggering time-sensitive and critical mitigation actions, such as redeployment and VM restart, to prevent end-user disruptions. Customers can receive alerts within seconds of critical changes in resource availability.

Azure monitor—VM availability metric

Currently in public preview. It’s well-suited for tracking trends, aggregating platform metrics (such as CPU and disk usage) and configuring precise threshold-based alerts. Customers can utilize this out-of-the-box VM availability metric in Azure Monitor.

Azure resource health

Currently generally available. It offers immediate and user-friendly health checks for individual resources through the portal. Customers can quickly access the resource health blade on the portal and also review a 30-day historical record of health checks, making it an excellent tool for fast and straightforward troubleshooting.

Facilitating holistic VM availability monitoring

For a holistic approach to monitoring VM availability, including scenarios of routine maintenance, live migration, service healing, and VM degradation, we recommend you utilize both scheduled events (SE) and Flash health events.

Scheduled events are designed to offer an early warning, giving up to 15 minute advance notice prior to maintenance activities. This lead time enables you to make informed decisions regarding upcoming downtime, allowing you to either avoid or prepare for it. You have the flexibility to either acknowledge these events or delay actions during this 15 minute period, depending on your readiness for the upcoming maintenance.

On the other hand, Flash Health events are focused on real-time tracking of ongoing and completed availability disruptions, including VM degradation. This feature empowers you to effectively monitor and manage downtime, supporting automated mitigation, investigations, and post-mortem analysis.

To get started on your observability journey, you can explore the suite of Azure products to which we emit high-quality VM health data to. These products include resource health, activity logs, Azure resource graph, Azure monitor metrics and Azure event grid system topic.

Learn more about the Flash initiative

Please stay tuned for more announcements on the Flash initiative, by tracking updates to the advancing reliability series!