mirror of
https://github.com/prometheus-community/windows_exporter.git
synced 2026-02-07 21:46:37 +00:00
139
docs/collector.gpu.md
Normal file
139
docs/collector.gpu.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# gpu collector
|
||||
|
||||
The gpu collector exposes metrics about GPU usage and memory consumption, both at the adapter (physical GPU) and
|
||||
per-process level.
|
||||
|
||||
| | |
|
||||
|---------------------|--------------------------------------|
|
||||
| Metric name prefix | `gpu` |
|
||||
| Data source | Perflib |
|
||||
| Counters | GPU Engine, GPU Adapter, GPU Process |
|
||||
| Enabled by default? | No |
|
||||
|
||||
## Flags
|
||||
|
||||
None
|
||||
|
||||
## Metrics
|
||||
|
||||
These metrics are available on supported versions of Windows with compatible GPUs and drivers:
|
||||
|
||||
### Adapter-level Metrics
|
||||
|
||||
| Name | Description | Type | Labels |
|
||||
|----------------------------------------------|----------------------------------------------------------|-------|--------|
|
||||
| `windows_gpu_adapter_memory_committed_bytes` | Total committed GPU memory in bytes per physical GPU | gauge | `phys` |
|
||||
| `windows_gpu_adapter_memory_dedicated_bytes` | Dedicated GPU memory usage in bytes per physical GPU | gauge | `phys` |
|
||||
| `windows_gpu_adapter_memory_shared_bytes` | Shared GPU memory usage in bytes per physical GPU | gauge | `phys` |
|
||||
| `windows_gpu_local_adapter_memory_bytes` | Local adapter memory usage in bytes per physical GPU | gauge | `phys` |
|
||||
| `windows_gpu_non_local_adapter_memory_bytes` | Non-local adapter memory usage in bytes per physical GPU | gauge | `phys` |
|
||||
|
||||
### Per-process Metrics
|
||||
|
||||
| Name | Description | Type | Labels |
|
||||
|----------------------------------------------|-------------------------------------------------|---------|----------------------------------------|
|
||||
| `windows_gpu_engine_time_seconds` | Total running time of the GPU engine in seconds | counter | `phys`, `eng`, `engtype`, `process_id` |
|
||||
| `windows_gpu_process_memory_committed_bytes` | Total committed GPU memory in bytes per process | gauge | `phys`,`process_id` |
|
||||
| `windows_gpu_process_memory_dedicated_bytes` | Dedicated GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
|
||||
| `windows_gpu_process_memory_local_bytes` | Local GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
|
||||
| `windows_gpu_process_memory_non_local_bytes` | Non-local GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
|
||||
| `windows_gpu_process_memory_shared_bytes` | Shared GPU memory usage in bytes per process | gauge | `phys`,`process_id` |
|
||||
|
||||
## Metric Labels
|
||||
|
||||
* `phys`: Physical GPU index (e.g., "0")
|
||||
* `eng`: GPU engine index (e.g., "0", "1", ...)
|
||||
* `engtype`: GPU engine type (e.g., "3D", "Copy", "VideoDecode", etc.)
|
||||
* `process_id`: Process ID
|
||||
|
||||
## Example Metric
|
||||
|
||||
These are basic queries to help you get started with GPU monitoring on Windows using Prometheus.
|
||||
|
||||
**Show total dedicated GPU memory (in bytes) usage on GPU 0:**
|
||||
|
||||
```promql
|
||||
windows_gpu_adapter_memory_dedicated_bytes{phys="0"}
|
||||
```
|
||||
|
||||
**Aggregate GPU utilization across all processes for a physical GPU (3D engine):**
|
||||
|
||||
```promql
|
||||
sum by (phys) (
|
||||
rate(windows_gpu_engine_time_seconds{phys="0", engtype="3D"}[1m])
|
||||
) * 100
|
||||
```
|
||||
|
||||
**Show GPU utilization for a specific process (3D engine):**
|
||||
|
||||
```promql
|
||||
sum by (phys, process_id) (
|
||||
rate(windows_gpu_engine_time_seconds{process_id="1234", engtype="3D"}[1m])
|
||||
) * 100
|
||||
```
|
||||
|
||||
**Show dedicated GPU memory per process:**
|
||||
|
||||
```promql
|
||||
windows_gpu_adapter_memory_dedicated_bytes
|
||||
```
|
||||
|
||||
## Useful Queries
|
||||
|
||||
**Show top 5 processes by GPU utilization (all engines):**
|
||||
|
||||
```promql
|
||||
topk(5, sum by (process_id) (
|
||||
rate(windows_gpu_engine_time_seconds[1m])
|
||||
) * 100)
|
||||
```
|
||||
|
||||
**Show GPU memory usage per physical GPU:**
|
||||
|
||||
```promql
|
||||
sum by (phys) (
|
||||
windows_gpu_adapter_memory_dedicated_bytes
|
||||
)
|
||||
```
|
||||
|
||||
Show GPU engine time with process owner and command line:
|
||||
|
||||
```promql
|
||||
windows_gpu_engine_time_seconds * on(process_id) group_left(owner, cmdline) windows_process_info
|
||||
```
|
||||
|
||||
## Alerting Examples
|
||||
|
||||
**prometheus.rules**
|
||||
|
||||
```yaml
|
||||
# Alert on processes using more than 80% of a GPU's capacity over 10 minutes
|
||||
- alert: HighGpuUtilization
|
||||
expr: |
|
||||
sum by (process_id) (
|
||||
rate(windows_gpu_engine_time_seconds[1m])
|
||||
) * 100 > 80
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High GPU Utilization (process {{ $labels.process_id }})"
|
||||
description: "Process is using more than 80% of GPU resources\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
* Per-process metrics allow you to identify which processes are consuming GPU resources.
|
||||
* Adapter-level metrics provide an overview of total GPU memory usage.
|
||||
* For overall GPU utilization, aggregate per-process metrics in Prometheus using queries such as `sum()`.
|
||||
* The collector relies on Windows performance counters; ensure your system and drivers support these counters.
|
||||
|
||||
## Enabling the Collector
|
||||
|
||||
To enable the GPU collector, add `gpu` to the list of enabled collectors in your windows_exporter configuration.
|
||||
|
||||
Example (command line):
|
||||
|
||||
```shell
|
||||
windows_exporter.exe --collectors.enabled=gpu
|
||||
```
|
||||
Reference in New Issue
Block a user