Compare commits

...

4 Commits

Author SHA1 Message Date
Dominik Eisenberg
56c29a6280 mscluster: Add virtual disk metrics sub-collector (#2296)
Signed-off-by: Dominik Eisenberg <d.business@outlook.de>
Signed-off-by: Jan-Otto Kröpke <mail@jkroepke.de>
Signed-off-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
Co-authored-by: Jan-Otto Kröpke <mail@jkroepke.de>
Co-authored-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
2026-02-08 21:01:53 +01:00
Dominik Eisenberg
78395afc67 mscluster: add shared volumes collector and update documentation (#2301)
Signed-off-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
Signed-off-by: Jan-Otto Kröpke <mail@jkroepke.de>
Co-authored-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
Co-authored-by: Jan-Otto Kröpke <mail@jkroepke.de>
2026-02-08 12:53:40 +00:00
Dominik Eisenberg
e951e516de docs: add alerting examples for CPU and CSV (#2317)
Signed-off-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
Co-authored-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
2026-02-08 13:46:02 +01:00
Dominik Eisenberg
ec6f705410 os: rename install_time_timestamp to install_time_timestamp_seconds (#2315)
Co-authored-by: EisenbergD <dominik.eisenberg@beiersdorf.com>
2026-02-04 13:23:28 +01:00
9 changed files with 430 additions and 18 deletions

View File

@@ -80,7 +80,36 @@ avg by(instance) (
## Alerting examples
**prometheus.rules**
#### Average CPU utilization over 1 hour exceeds 80% (New CPU metric)
```yaml
# Alert on hosts with 1h avg CPU more than 80%
- alert: HighCPUUtilization
expr: |
avg_over_time(
(
sum by (instance) (
(
rate(windows_cpu_processor_utility_total{}[1m])
/
rate(windows_cpu_processor_rtc_total{}[1m])
)
) /
count by (instance) (
windows_cpu_processor_utility_total{}
)
)[1h:]
) > 80
for: 1m
labels:
severity: warning
metric_name: CPUUtilization
annotations:
summary: "High CPU utilization on {{ $labels.instance }}"
description: |
CPU utilization on {{ $labels.instance }} has averaged more than 80% over the last hour (current value: {{ printf "%.2f" $value }})
```
#### Average CPU utilization over 1 hour exceeds 80% (Old CPU metric)
```yaml
# Alert on hosts with more than 80% CPU usage over a 10 minute period
- alert: CpuUsage
@@ -91,6 +120,10 @@ avg by(instance) (
annotations:
summary: "CPU Usage (instance {{ $labels.instance }})"
description: "CPU Usage is more than 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
```
#### CPU not using boost frequencies
```yaml
# Alert on hosts which are not boosting their CPU frequencies
- alert: NoCpuTurbo
expr: |

View File

@@ -5,14 +5,14 @@ The MSCluster_Cluster class is a dynamic WMI class that represents a cluster.
|||
-|-
Metric name prefix | `mscluster`
Classes | `MSCluster_Cluster`,`MSCluster_Network`,`MSCluster_Node`,`MSCluster_Resource`,`MSCluster_ResourceGroup`
Classes | `MSCluster_Cluster`,`MSCluster_Network`,`MSCluster_Node`,`MSCluster_Resource`,`MSCluster_ResourceGroup`,`MSCluster_DiskPartition`,`MSFT_VirtualDisk`
Enabled by default? | No
## Flags
### `--collectors.mscluster.enabled`
Comma-separated list of collectors to use, for example:
`--collectors.mscluster.enabled=cluster,network,node,resource,resouregroup`.
`--collectors.mscluster.enabled=cluster,network,node,resource,resouregroup,shared_volumes,virtualdisk`.
Matching is case-sensitive.
## Metrics
@@ -170,17 +170,68 @@ Matching is case-sensitive.
| `mscluster_resourcegroup_State` | The current state of the resource group. -1: Unknown; 0: Online; 1: Offline; 2: Failed; 3: Partial Online; 4: Pending | gauge | `name` |
| `mscluster_resourcegroup_UpdateDomain` | | gauge | `name` |
### Shared Volumes
| Name | Description | Type | Labels |
|------------------------------------------|----------------------------------------------------------------|-------|-----------------------------|
| `mscluster_shared_volumes_info` | Cluster Shared Volumes information (value is always 1) | gauge | `name`,`path`,`volume_guid` |
| `mscluster_shared_volumes_total_bytes` | Total size of the Cluster Shared Volume in bytes | gauge | `name`,`volume_guid` |
| `mscluster_shared_volumes_free_bytes` | Free space on the Cluster Shared Volume in bytes | gauge | `name`,`volume_guid` |
### Virtual Disk
| Name | Description | Type | Labels |
|-----------------------------------------------------------|------------------------------------------------------------------------------------------------|-------|--------|
| `mscluster_virtualdisk_info` | Virtual disk information (value is always 1) | gauge | `name`, `unique_id` |
| `mscluster_virtualdisk_health_status` | Health status of the virtual disk. 0: Healthy, 1: Warning, 2: Unhealthy, 5: Unknown | gauge | `name`, `unique_id` |
| `mscluster_virtualdisk_size_bytes` | Total size of the virtual disk in bytes | gauge | `name`, `unique_id` |
| `mscluster_virtualdisk_footprint_on_pool_bytes` | Physical storage consumed by the virtual disk on the storage pool in bytes | gauge | `name`, `unique_id` |
| `mscluster_virtualdisk_storage_efficiency_percent` | Storage efficiency percentage (Size / FootprintOnPool * 100) | gauge | `name`, `unique_id` |
### Example metric
Query the state of all cluster resource owned by node1
```
windows_mscluster_resource_owner_node{node_name="node1"}
```
Query virtual disk storage efficiency for thin provisioned disks
```
windows_mscluster_virtualdisk_storage_efficiency_percent
```
## Useful queries
Counts the number of Network Name cluster resource
```
count(windows_mscluster_resource_state{type="Network Name"})
```
Find virtual disks with low storage efficiency (over-provisioned)
```
windows_mscluster_virtualdisk_storage_efficiency_percent < 50
```
Calculate total virtual disk capacity vs physical usage
```
sum(windows_mscluster_virtualdisk_size_bytes) / sum(windows_mscluster_virtualdisk_footprint_on_pool_bytes) * 100
```
## Alerting examples
_This collector does not yet have alerting examples, we would appreciate your help adding them!_
#### Low free space on cluster shared volume
```yaml
# Alerts if volume has less then 20% free space
- alert: LowCSVFreeSpace
expr: |
(
max by (name, cluster) (windows_mscluster_shared_volumes_free_bytes{name!="ClusterPerformanceHistory"})
/
max by (name, cluster) (windows_mscluster_shared_volumes_total_bytes{name!="ClusterPerformanceHistory"})
) * 100 < 20
for: 10m
labels:
severity: warning
annotations:
summary: "Low CSV free space on {{ $labels.name }}"
description: |
Cluster Shared Volume {{ $labels.name }} on cluster {{ $labels.cluster }} has less than 20% free space (current: {{ printf "%.2f" $value }}%)
```

View File

@@ -14,11 +14,11 @@ None
## Metrics
| Name | Description | Type | Labels |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------------------------------------------------------------------------------------------------|
| `windows_os_hostname` | Labelled system hostname information as provided by ComputerSystem.DNSHostName and ComputerSystem.Domain | gauge | `domain`, `fqdn`, `hostname` |
| `windows_os_info` | Contains full product name & version in labels. Note that the `major_version` for Windows 11 is "10"; a build number greater than 22000 represents Windows 11. | gauge | `product`, `version`, `major_version`, `minor_version`, `build_number`, `revision`, `installation_type` |
| `windows_os_install_time_timestamp` | Unix timestamp of OS installation time | gauge | None |
| Name | Description | Type | Labels |
|----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------------------------------------------------------------------------------------------------|
| `windows_os_hostname` | Labelled system hostname information as provided by ComputerSystem.DNSHostName and ComputerSystem.Domain | gauge | `domain`, `fqdn`, `hostname` |
| `windows_os_info` | Contains full product name & version in labels. Note that the `major_version` for Windows 11 is "10"; a build number greater than 22000 represents Windows 11. | gauge | `product`, `version`, `major_version`, `minor_version`, `build_number`, `revision`, `installation_type` |
| `windows_os_install_time_timestamp_seconds` | Unix timestamp of OS installation time | gauge | None |
### Example metric
@@ -29,13 +29,28 @@ windows_os_hostname{domain="",fqdn="PC",hostname="PC"} 1
# HELP windows_os_info Contains full product name & version in labels. Note that the "major_version" for Windows 11 is \\"10\\"; a build number greater than 22000 represents Windows 11.
# TYPE windows_os_info gauge
windows_os_info{build_number="19045",installation_type="Client",major_version="10",minor_version="0",product="Windows 10 Pro",revision="4842",version="10.0.19045"} 1
# HELP windows_os_install_time_timestamp Unix timestamp of OS installation time
# TYPE windows_os_install_time_timestamp gauge
windows_os_install_time_timestamp 1.6725312e+09
# HELP windows_os_install_time_timestamp_seconds Unix timestamp of OS installation time
# TYPE windows_os_install_time_timestamp_seconds gauge
windows_os_install_time_timestamp_seconds 1.6725312e+09
```
## Useful queries
_This collector does not yet have useful queries, we would appreciate your help adding them!_
## Alerting examples
_This collector does not yet have alerting examples, we would appreciate your help adding them!_
#### Average CPU utilization over 1 hour exceeds 80% (New CPU metric)
```yaml
# Alerts if Agent/Host is down for 5min
- alert: HypervHostDown
expr: up{app="hyper-v"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: Hyper-V host {{ $labels.instance }} is down
description: |
Hyper-V host {{ $labels.instance }} has been unreachable for more than 5 minutes.
Job: {{ $labels.job }}
```

View File

@@ -38,6 +38,8 @@ const (
subCollectorNode = "node"
subCollectorResource = "resource"
subCollectorResourceGroup = "resourcegroup"
subCollectorSharedVolumes = "shared_volumes"
subCollectorVirtualDisk = "virtualdisk"
)
type Config struct {
@@ -52,6 +54,8 @@ var ConfigDefaults = Config{
subCollectorNode,
subCollectorResource,
subCollectorResourceGroup,
subCollectorSharedVolumes,
subCollectorVirtualDisk,
},
}
@@ -62,6 +66,8 @@ type Collector struct {
collectorNode
collectorResource
collectorResourceGroup
collectorSharedVolumes
collectorVirtualDisk
config Config
miSession *mi.Session
@@ -156,6 +162,18 @@ func (c *Collector) Build(_ *slog.Logger, miSession *mi.Session) error {
}
}
if slices.Contains(c.config.CollectorsEnabled, subCollectorSharedVolumes) {
if err := c.buildSharedVolumes(); err != nil {
errs = append(errs, fmt.Errorf("failed to build shared_volumes collector: %w", err))
}
}
if slices.Contains(c.config.CollectorsEnabled, subCollectorVirtualDisk) {
if err := c.buildVirtualDisk(); err != nil {
errs = append(errs, fmt.Errorf("failed to build virtualdisk collector: %w", err))
}
}
return errors.Join(errs...)
}
@@ -166,10 +184,10 @@ func (c *Collector) Collect(ch chan<- prometheus.Metric) error {
return nil
}
errCh := make(chan error, 5)
errCh := make(chan error, 6)
wg := sync.WaitGroup{}
wg.Add(5)
wg.Add(6)
go func() {
defer wg.Done()
@@ -226,6 +244,22 @@ func (c *Collector) Collect(ch chan<- prometheus.Metric) error {
}()
}()
go func() {
defer wg.Done()
if slices.Contains(c.config.CollectorsEnabled, subCollectorSharedVolumes) {
if err := c.collectSharedVolumes(ch); err != nil {
errCh <- fmt.Errorf("failed to collect shared_volumes metrics: %w", err)
}
}
if slices.Contains(c.config.CollectorsEnabled, subCollectorVirtualDisk) {
if err := c.collectVirtualDisk(ch); err != nil {
errCh <- fmt.Errorf("failed to collect virtualdisk metrics: %w", err)
}
}
}()
wg.Wait()
close(errCh)

View File

@@ -0,0 +1,122 @@
// SPDX-License-Identifier: Apache-2.0
//
// Copyright The Prometheus Authors
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build windows
package mscluster
import (
"fmt"
"strings"
"github.com/prometheus-community/windows_exporter/internal/mi"
"github.com/prometheus-community/windows_exporter/internal/types"
"github.com/prometheus/client_golang/prometheus"
)
const nameSharedVolumes = Name + "_shared_volumes"
type collectorSharedVolumes struct {
sharedVolumesMIQuery mi.Query
sharedVolumesInfo *prometheus.Desc
sharedVolumesTotalSize *prometheus.Desc
sharedVolumesFreeSpace *prometheus.Desc
}
// msClusterDiskPartition represents the MSCluster_DiskPartition WMI class
type msClusterDiskPartition struct {
Name string `mi:"Name"`
Path string `mi:"Path"`
TotalSize uint64 `mi:"TotalSize"`
FreeSpace uint64 `mi:"FreeSpace"`
Volume string `mi:"VolumeLabel"`
VolumeGuid string `mi:"VolumeGuid"`
}
func (c *Collector) buildSharedVolumes() error {
sharedVolumesMIQuery, err := mi.NewQuery("SELECT Name, Path, TotalSize, FreeSpace, VolumeLabel, VolumeGuid FROM MSCluster_DiskPartition")
if err != nil {
return fmt.Errorf("failed to create WMI query: %w", err)
}
c.sharedVolumesMIQuery = sharedVolumesMIQuery
c.sharedVolumesInfo = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameSharedVolumes, "info"),
"Cluster Shared Volumes information (value is always 1)",
[]string{"name", "path", "volume_guid"},
nil,
)
c.sharedVolumesTotalSize = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameSharedVolumes, "total_bytes"),
"Total size of the Cluster Shared Volume in bytes",
[]string{"name", "volume_guid"},
nil,
)
c.sharedVolumesFreeSpace = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameSharedVolumes, "free_bytes"),
"Free space on the Cluster Shared Volume in bytes",
[]string{"name", "volume_guid"},
nil,
)
var dst []msClusterDiskPartition
if err := c.miSession.Query(&dst, mi.NamespaceRootMSCluster, c.sharedVolumesMIQuery); err != nil {
return fmt.Errorf("WMI query failed: %w", err)
}
return nil
}
func (c *Collector) collectSharedVolumes(ch chan<- prometheus.Metric) error {
var dst []msClusterDiskPartition
if err := c.miSession.Query(&dst, mi.NamespaceRootMSCluster, c.sharedVolumesMIQuery); err != nil {
return fmt.Errorf("WMI query failed: %w", err)
}
for _, partition := range dst {
volume := strings.TrimRight(partition.Volume, " ")
ch <- prometheus.MustNewConstMetric(
c.sharedVolumesInfo,
prometheus.GaugeValue,
1.0,
volume,
partition.Path,
partition.VolumeGuid,
)
ch <- prometheus.MustNewConstMetric(
c.sharedVolumesTotalSize,
prometheus.GaugeValue,
float64(partition.TotalSize)*1024*1024, // Convert from KB to bytes
volume,
partition.VolumeGuid,
)
ch <- prometheus.MustNewConstMetric(
c.sharedVolumesFreeSpace,
prometheus.GaugeValue,
float64(partition.FreeSpace)*1024*1024, // Convert from KB to bytes
volume,
partition.VolumeGuid,
)
}
return nil
}

View File

@@ -0,0 +1,156 @@
// SPDX-License-Identifier: Apache-2.0
//
// Copyright The Prometheus Authors
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//go:build windows
package mscluster
import (
"fmt"
"github.com/prometheus-community/windows_exporter/internal/mi"
"github.com/prometheus-community/windows_exporter/internal/types"
"github.com/prometheus/client_golang/prometheus"
)
const nameVirtualDisk = Name + "_virtualdisk"
type collectorVirtualDisk struct {
virtualDiskMIQuery mi.Query
virtualDiskInfo *prometheus.Desc
virtualDiskHealthStatus *prometheus.Desc
virtualDiskSize *prometheus.Desc
virtualDiskFootprintOnPool *prometheus.Desc
virtualDiskStorageEfficiency *prometheus.Desc
}
// msftVirtualDisk represents the MSFT_VirtualDisk WMI class
type msftVirtualDisk struct {
FriendlyName string `mi:"FriendlyName"`
UniqueId string `mi:"UniqueId"`
HealthStatus uint16 `mi:"HealthStatus"`
Size uint64 `mi:"Size"`
FootprintOnPool uint64 `mi:"FootprintOnPool"`
// OperationalStatus []uint16 `mi:"OperationalStatus"` Not supported my mi query: https://github.com/prometheus-community/windows_exporter/pull/2296#issuecomment-3736584632
}
func (c *Collector) buildVirtualDisk() error {
wmiSelect := "FriendlyName,UniqueId,HealthStatus,Size,FootprintOnPool"
virtualDiskMIQuery, err := mi.NewQuery(fmt.Sprintf("SELECT %s FROM MSFT_VirtualDisk", wmiSelect))
if err != nil {
return fmt.Errorf("failed to create WMI query: %w", err)
}
c.virtualDiskMIQuery = virtualDiskMIQuery
c.virtualDiskInfo = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameVirtualDisk, "info"),
"Virtual Disk information (value is always 1)",
[]string{"name", "unique_id"},
nil,
)
c.virtualDiskHealthStatus = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameVirtualDisk, "health_status"),
"Health status of the virtual disk. 0: Healthy, 1: Warning, 2: Unhealthy, 5: Unknown",
[]string{"name", "unique_id"},
nil,
)
c.virtualDiskSize = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameVirtualDisk, "size_bytes"),
"Total size of the virtual disk in bytes",
[]string{"name", "unique_id"},
nil,
)
c.virtualDiskFootprintOnPool = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameVirtualDisk, "footprint_on_pool_bytes"),
"Physical storage consumed by the virtual disk on the storage pool in bytes",
[]string{"name", "unique_id"},
nil,
)
c.virtualDiskStorageEfficiency = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, nameVirtualDisk, "storage_efficiency_percent"),
"Storage efficiency percentage (Size / FootprintOnPool * 100)",
[]string{"name", "unique_id"},
nil,
)
return nil
}
func (c *Collector) collectVirtualDisk(ch chan<- prometheus.Metric) error {
var dst []msftVirtualDisk
if err := c.miSession.Query(&dst, mi.NamespaceRootStorage, c.virtualDiskMIQuery); err != nil {
return fmt.Errorf("WMI query failed: %w", err)
}
for _, vdisk := range dst {
ch <- prometheus.MustNewConstMetric(
c.virtualDiskInfo,
prometheus.GaugeValue,
1.0,
vdisk.FriendlyName,
vdisk.UniqueId,
)
ch <- prometheus.MustNewConstMetric(
c.virtualDiskHealthStatus,
prometheus.GaugeValue,
float64(vdisk.HealthStatus),
vdisk.FriendlyName,
vdisk.UniqueId,
)
ch <- prometheus.MustNewConstMetric(
c.virtualDiskSize,
prometheus.GaugeValue,
float64(vdisk.Size),
vdisk.FriendlyName,
vdisk.UniqueId,
)
ch <- prometheus.MustNewConstMetric(
c.virtualDiskFootprintOnPool,
prometheus.GaugeValue,
float64(vdisk.FootprintOnPool),
vdisk.FriendlyName,
vdisk.UniqueId,
)
// Calculate storage efficiency (avoid division by zero)
var storageEfficiency float64
if vdisk.FootprintOnPool > 0 {
storageEfficiency = float64(vdisk.Size) / float64(vdisk.FootprintOnPool) * 100
} else {
storageEfficiency = 0
}
ch <- prometheus.MustNewConstMetric(
c.virtualDiskStorageEfficiency,
prometheus.GaugeValue,
storageEfficiency,
vdisk.FriendlyName,
vdisk.UniqueId,
)
}
return nil
}

View File

@@ -122,7 +122,7 @@ func (c *Collector) Build(_ *slog.Logger, _ *mi.Session) error {
)
c.installTime = prometheus.NewDesc(
prometheus.BuildFQName(types.Namespace, Name, "install_time_timestamp"),
prometheus.BuildFQName(types.Namespace, Name, "install_time_timestamp_seconds"),
"Unix timestamp of OS installation time",
nil,
nil,

View File

@@ -54,6 +54,7 @@ var (
NamespaceRootWebAdministration = utils.Must(NewNamespace("root/WebAdministration"))
NamespaceRootMSCluster = utils.Must(NewNamespace("root/MSCluster"))
NamespaceRootMicrosoftDNS = utils.Must(NewNamespace("root/MicrosoftDNS"))
NamespaceRootStorage = utils.Must(NewNamespace("root/Microsoft/Windows/Storage"))
)
type Query *uint16

View File

@@ -287,8 +287,8 @@ windows_exporter_collector_timeout{collector="udp"} 0
# TYPE windows_os_hostname gauge
# HELP windows_os_info Contains full product name & version in labels. Note that the "major_version" for Windows 11 is \\"10\\"; a build number greater than 22000 represents Windows 11.
# TYPE windows_os_info gauge
# HELP windows_os_install_time_timestamp Unix timestamp of OS installation time
# TYPE windows_os_install_time_timestamp gauge
# HELP windows_os_install_time_timestamp_seconds Unix timestamp of OS installation time
# TYPE windows_os_install_time_timestamp_seconds gauge
# HELP windows_pagefile_free_bytes Number of bytes that can be mapped into the operating system paging files without causing any other pages to be swapped out
# TYPE windows_pagefile_free_bytes gauge
# HELP windows_pagefile_limit_bytes Number of bytes that can be stored in the operating system paging files. 0 (zero) indicates that there are no paging files