nvidia-smi常用命令使用指南

2021-02-19 AliceWanderAI

什麼是nvidia-smi

nvidia-smi是nvidia 的系統管理界面 ,其中smi是System management interface的縮寫,它可以收集各種級別的信息,查看顯存使用情況。此外, 可以啟用和禁用 GPU 配置選項 (如 ECC 內存功能)。

nvidia-smi簡稱NVSMI,可查詢顯卡各種硬體規格指標,支持Windows 64位和Linux,該工具是N卡驅動附帶的,只要安裝好驅動後就會有它。

常用命令使用實例

在windows shell或者linux terminal中輸入命令,顯示出當前GPU的所有基礎信息。

$nvidia-smi

列出所有可用的 NVIDIA 設備

$nvidia-smi -L

查看系統拓撲
$nvidia-smi topo --matrix

查看當前的 GPU 時鐘速度、默認時鐘速度和最大可能的時鐘速度
$nvidia-smi -q -d CLOCK

顯示每個 GPU 的可用時鐘速度列表

$nvidia-smi -q -d SUPPORTED_CLOCKS

查看當前vGPU的狀態信息

$nvidia-smi vgpu

循環顯示虛擬桌面中應用程式對GPU資源的佔用情況

$nvidia-smi vgpu -p

查看當前所有GPU的信息,也可以通過參數i指定具體的GPU。
$nvidia-smi -q

***********************************************

在linux terminal中輸入nvidia-smi --help可以得到更具體的指南。

$nvidia-smi --help

NVIDIA System Management Interface -- v370.28

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available. The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
- All Tesla products, starting with the Fermi architecture
- All Quadro products, starting with the Fermi architecture
- All GRID products, starting with the Kepler architecture
- GeForce Titan products, starting with the Kepler architecture
- Limited Support
- All Geforce products, starting with the Fermi architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

-h, --help Print usage information and exit.

LIST OPTIONS:

-L, --list-gpus Display a list of GPUs connected to the system.

SUMMARY OPTIONS:

<no arguments> Show a summary of GPUs connected to the system.

[plus any of]

-i, --id= Target a specific GPU.
-f, --filename= Log to a specified file, rather than to stdout.
-l, --loop= Probe until Ctrl+C at specified second interval.

QUERY OPTIONS:

-q, --query Display GPU or Unit info.

[plus any of]

-u, --unit Show unit, rather than GPU, attributes.
-i, --id= Target a specific GPU or Unit.
-f, --filename= Log to a specified file, rather than to stdout.
-x, --xml-format Produce XML output.
--dtd When showing xml output, embed DTD.
-d, --display= Display only selected information: MEMORY,
UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
PAGE_RETIREMENT, ACCOUNTING.
Flags can be combined with comma e.g. ECC,POWER.
Sampling data with max/min/avg is also returned
for POWER, UTILIZATION and CLOCK display types.
Doesn't work with -u or -x flags.
-l, --loop= Probe until Ctrl+C at specified second interval.

-lms, --loop-ms= Probe until Ctrl+C at specified millisecond interval.

SELECTIVE QUERY OPTIONS:

Allows the caller to pass an explicit list of properties to query.

[one of]

--query-gpu= Information about GPU.
Call --help-query-gpu for more info.
--query-supported-clocks= List of supported clocks.
Call --help-query-supported-clocks for more info.
--query-compute-apps= List of currently active compute processes.
Call --help-query-compute-apps for more info.
--query-accounted-apps= List of accounted compute processes.
Call --help-query-accounted-apps for more info.
--query-retired-pages= List of device memory pages that have been retired.
Call --help-query-retired-pages for more info.

[mandatory]

--format= Comma separated list of format options:
csv - comma separated values (MANDATORY)
noheader - skip the first line with column headers
nounits - don't print units for numerical
values

[plus any of]

-i, --id= Target a specific GPU or Unit.
-f, --filename= Log to a specified file, rather than to stdout.
-l, --loop= Probe until Ctrl+C at specified second interval.
-lms, --loop-ms= Probe until Ctrl+C at specified millisecond interval.

DEVICE MODIFICATION OPTIONS:

[any one of]

-pm, --persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED
-e, --ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED
-p, --reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
-c, --compute-mode= Set MODE for compute applications:
0/DEFAULT, 1/EXCLUSIVE_PROCESS,
2/PROHIBITED
--gom= Set GPU Operation Mode:
0/ALL_ON, 1/COMPUTE, 2/LOW_DP
-r --gpu-reset Trigger reset of the GPU.
Can be used to reset the GPU HW state in situations
that would otherwise require a machine reboot.
Typically useful if a double bit ECC error has
occurred.
Reset operations are not guarenteed to work in
all cases and should be used with caution.
--id= switch is mandatory for this switch
-vm --virt-mode= Switch GPU Virtualization Mode:
Sets GPU virtualization mode to 3/VGPU or 4/VSGA
Virtualization mode of a GPU can only be set when
it is running on a hypervisor.
-ac --applications-clocks= Specifies <memory,graphics> clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
-pl --power-limit= Specifies maximum power management limit in watts.
-am --accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
-caa --clear-accounted-apps
Clears all the accounted PIDs in the buffer.
--auto-boost-default= Set the default auto boost policy to 0/DISABLED
or 1/ENABLED, enforcing the change only after the
last boost client has exited.
--auto-boost-permission=
Allow non-admin/root control over auto boost mode:
0/UNRESTRICTED, 1/RESTRICTED
[plus optional]

-i, --id= Target a specific GPU.

UNIT MODIFICATION OPTIONS:

-t, --toggle-led= Set Unit LED state: 0/GREEN, 1/AMBER

[plus optional]

-i, --id= Target a specific Unit.

SHOW DTD OPTIONS:

--dtd Print device DTD and exit.

[plus optional]

-f, --filename= Log to a specified file, rather than to stdout.
-u, --unit Show unit, rather than device, DTD.

--debug= Log encrypted debug information to a specified file.

STATISTICS: (EXPERIMENTAL)
stats Displays device statistics. "nvidia-smi stats -h" for more information.

Device Monitoring:
dmon Displays device stats in scrolling format.
"nvidia-smi dmon -h" for more information.

daemon Runs in background and monitor devices as a daemon process.
This is an experimental feature.
"nvidia-smi daemon -h" for more information.

replay Used to replay/extract the persistent stats generated by daemon.
This is an experimental feature.
"nvidia-smi replay -h" for more information.

Process Monitoring:
pmon Displays process stats in scrolling format.
"nvidia-smi pmon -h" for more information.

TOPOLOGY:
topo Displays device/system topology. "nvidia-smi topo -h" for more information.

NVLINK:
nvlink Displays device nvlink information. "nvidia-smi nvlink -h" for more information.

CLOCKS:
clocks Control and query clock information. "nvidia-smi clocks -h" for more information.

Please see the nvidia-smi(1) manual page for more detailed inform

參考資料
https://pytorch.org/tutorials/advanced/cpp_export.html
https://www.cnblogs.com/omgasw/p/10218180.html
https://www.equn.com/forum/thread-41746-1-1.html

相關焦點

  • Windows電腦深度學習環境超詳細配置指南
    在這種情況下,你並不總能避免使用 Windows。如果你遇到這種情況,或者正好擁有一臺 Windows 計算機,又或者還不能熟練使用 Linux,那麼這份指南肯定能幫到你。一些 GPU 術語安裝 GPU 驅動安裝 TensorFlow(CPU 和 GPU)安裝 PyTorch(CPU 和 GPU)驗證安裝情況我的個人經驗和替代方法1.硬體和軟體的最低要求如果你要按照本指南操作並且計劃使用
  • 保姆級教程:個人深度學習工作站配置指南
    系統篇系統選擇DL開發裡面最常用的Ubuntu,最新的穩定版本是20.04,安裝過程需要準備一個U盤作為系統啟動盤。sudo apt updatesudo apt upgrade這裡會連帶Nvidia的驅動一起升級一遍,更新到最新的驅動;更新完可能會出現nvidia-smi命令報錯,再重啟一下就解決了。
  • 氣象編程 | 個人深度學習工作站配置指南
    系統篇系統選擇DL開發裡面最常用的Ubuntu,最新的穩定版本是20.04,安裝過程需要準備一個U盤作為系統啟動盤。sudo apt updatesudo apt upgrade這裡會連帶Nvidia的驅動一起升級一遍,更新到最新的驅動;更新完可能會出現nvidia-smi命令報錯,再重啟一下就解決了。
  • 收藏 | 13則PyTorch使用的小竅門
    0,1" ,根據順序表示優先使用0號設備,然後使用1號設備。指定GPU的命令需要放在和神經網絡相關的一系列操作的前面。2、查看模型每層輸出詳情Keras有一個簡潔的API來查看模型的每一層輸出尺寸,這在調試網絡時非常有用。現在在PyTorch中也可以實現這個功能。
  • vivo AI 計算平臺的K8s填坑指南
    確認 runc 的 kmem accounting 關閉後,下一步是確認 kubelet 的 kmem accounting 是否關閉。重啟機器後執行命令:如果 mem.txt 的內容是空,說明 kubelet 的 kmem accounting 也成功關閉。該機器的 kmem 問題已修復。
  • 使用google colab訓練YOLOv5模型
    colab_yolov5GPU環境設置好後,我們就可以在notebook中查看colab提供的gpu資源了,使用!nvidia-smi命令colab_yolov5接下來就把google drive掛載過來,這樣就可以在colab中使用google drive中的資源了
  • 如何在絕大部分類型的機器上安裝 NVIDIA 顯卡驅動 | Linux 中國
    本安裝指南使用 Fedora 28 的新的第三方倉庫來安裝 NVIDIA 驅動。它將引導您完成硬體和軟體兩方面的安裝,並且涵蓋需要讓你的 NVIDIA 顯卡啟動和運行起來的一切知識。這個流程適用於任何支持 UEFI 的計算機和任意新的 NVIDIA 顯卡。
  • 10分鐘入門Keras指南
    首先你需要有一個Python開發環境,直接點就用Anaconda,然後在CMD命令行中安裝:# GPU 版本>>> pip install --upgrade tensorflow-gpu# CPU 版本>>> pip install --upgrade tensorflow# Keras 安裝>>> pip install
  • Git常用命令使用大全
    ,當你本地創建了一個工作目錄,你可以進入這個目錄,使用'git init'命令進行初始化;Git以後就會對該目錄下的文件進行版本控制,這時候如果你需要將它放到遠程伺服器上,可以在遠程伺服器上創建一個目錄,並把可訪問的URL記錄下來,此時你就可以利用'git remote add'命令來增加一個遠程伺服器端,例如:git  remote  add  origin  git://github.com
  • 十分鐘了解 git 那些「不常用」命令
    本文主要是介紹 git 不常用 初期不太會用的命令,希望你看了能理解這些命令的使用,並在平時使用過程中一點點地刻意進行練習,逐步熟練並知道何時需要用到這些命令去解決你的問題。( 我也在不斷熟練中)如果你還是剛剛接觸 git 命令,還不清楚倉庫 、工作流、分支、提交的童鞋可以先看下 git 使用簡易指南,這個應該是我初學 git 看的第一份且收藏至今的指南了~ 圖解很清晰易懂,真 10 分鐘入門的資料https://www.bootcss.com/p/git-guide/然後你會發現如下基礎命令將會成為你之後幾乎每天都要用到的