gpu: add metrics collector and related types (#2052) (#2059)

This commit is contained in:
小荣
2025-06-03 01:45:18 +08:00
committed by GitHub
parent e673f192d2
commit dcf85032ca
9 changed files with 798 additions and 48 deletions

139
docs/collector.gpu.md Normal file
View File

@@ -0,0 +1,139 @@
# gpu collector
The gpu collector exposes metrics about GPU usage and memory consumption, both at the adapter (physical GPU) and
per-process level.
| | |
|---------------------|--------------------------------------|
| Metric name prefix | `gpu` |
| Data source | Perflib |
| Counters | GPU Engine, GPU Adapter, GPU Process |
| Enabled by default? | No |
## Flags
None
## Metrics
These metrics are available on supported versions of Windows with compatible GPUs and drivers:
### Adapter-level Metrics
| Name | Description | Type | Labels |
|----------------------------------------------|----------------------------------------------------------|-------|--------|
| `windows_gpu_adapter_memory_committed_bytes` | Total committed GPU memory in bytes per physical GPU | gauge | `phys` |
| `windows_gpu_adapter_memory_dedicated_bytes` | Dedicated GPU memory usage in bytes per physical GPU | gauge | `phys` |
| `windows_gpu_adapter_memory_shared_bytes` | Shared GPU memory usage in bytes per physical GPU | gauge | `phys` |
| `windows_gpu_local_adapter_memory_bytes` | Local adapter memory usage in bytes per physical GPU | gauge | `phys` |
| `windows_gpu_non_local_adapter_memory_bytes` | Non-local adapter memory usage in bytes per physical GPU | gauge | `phys` |
### Per-process Metrics
| Name | Description | Type | Labels |
|----------------------------------------------|-------------------------------------------------|---------|----------------------------------------|
| `windows_gpu_engine_time_seconds` | Total running time of the GPU engine in seconds | counter | `phys`, `eng`, `engtype`, `process_id` |
| `windows_gpu_process_memory_committed_bytes` | Total committed GPU memory in bytes per process | gauge | `phys`,`process_id` |
| `windows_gpu_process_memory_dedicated_bytes` | Dedicated GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
| `windows_gpu_process_memory_local_bytes` | Local GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
| `windows_gpu_process_memory_non_local_bytes` | Non-local GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
| `windows_gpu_process_memory_shared_bytes` | Shared GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
## Metric Labels
* `phys`: Physical GPU index (e.g., "0")
* `eng`: GPU engine index (e.g., "0", "1", ...)
* `engtype`: GPU engine type (e.g., "3D", "Copy", "VideoDecode", etc.)
* `process_id`: Process ID
## Example Metric
These are basic queries to help you get started with GPU monitoring on Windows using Prometheus.
**Show total dedicated GPU memory (in bytes) usage on GPU 0:**
```promql
windows_gpu_adapter_memory_dedicated_bytes{phys="0"}
```
**Aggregate GPU utilization across all processes for a physical GPU (3D engine):**
```promql
sum by (phys) (
rate(windows_gpu_engine_time_seconds{phys="0", engtype="3D"}[1m])
) * 100
```
**Show GPU utilization for a specific process (3D engine):**
```promql
sum by (phys, process_id) (
rate(windows_gpu_engine_time_seconds{process_id="1234", engtype="3D"}[1m])
) * 100
```
**Show dedicated GPU memory per process:**
```promql
windows_gpu_adapter_memory_dedicated_bytes
```
## Useful Queries
**Show top 5 processes by GPU utilization (all engines):**
```promql
topk(5, sum by (process_id) (
rate(windows_gpu_engine_time_seconds[1m])
) * 100)
```
**Show GPU memory usage per physical GPU:**
```promql
sum by (phys) (
windows_gpu_adapter_memory_dedicated_bytes
)
```
Show GPU engine time with process owner and command line:
```promql
windows_gpu_engine_time_seconds * on(process_id) group_left(owner, cmdline) windows_process_info
```
## Alerting Examples
**prometheus.rules**
```yaml
# Alert on processes using more than 80% of a GPU's capacity over 10 minutes
- alert: HighGpuUtilization
expr: |
sum by (process_id) (
rate(windows_gpu_engine_time_seconds[1m])
) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High GPU Utilization (process {{ $labels.process_id }})"
description: "Process is using more than 80% of GPU resources\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
```
## Notes
* Per-process metrics allow you to identify which processes are consuming GPU resources.
* Adapter-level metrics provide an overview of total GPU memory usage.
* For overall GPU utilization, aggregate per-process metrics in Prometheus using queries such as `sum()`.
* The collector relies on Windows performance counters; ensure your system and drivers support these counters.
## Enabling the Collector
To enable the GPU collector, add `gpu` to the list of enabled collectors in your windows_exporter configuration.
Example (command line):
```shell
windows_exporter.exe --collectors.enabled=gpu
```