nvidia-device-plugin 守护进程启动失败
Feb 21, 2020 10:00 · 1077 words · 3 minute read
现象
执行 kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
,nvidia-device-plugin 容器启动后报错:
Error: failed to start container "nvidia-device-plugin-ctr": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown
排查
将 https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml 中 DaemonSet 的 Pod 镜像替换成 busybox 并执行 sh -c sleep 100000
后正常启动。
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: busybox:latest
name: nvidia-device-plugin-ctr
command:
- sh
- -c
- sleep 100000
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
查看 nvidia/k8s-device-plugin:1.0.0-beta4 镜像的 Dockerfile,容器启动进程为 nvidia-device-plugin。
FROM centos:7 as build
RUN yum install -y \
gcc-c++ \
ca-certificates \
wget && \
rm -rf /var/cache/yum/*
ENV GOLANG_VERSION 1.10.3
RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \
| tar -C /usr/local -xz
ENV GOPATH /go
ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH
WORKDIR /go/src/nvidia-device-plugin
COPY . .
RUN export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && \
go install -ldflags="-s -w" -v nvidia-device-plugin
FROM centos:7
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=utility
COPY --from=build /go/bin/nvidia-device-plugin /usr/bin/nvidia-device-plugin
CMD ["nvidia-device-plugin"]
首先确认容器运行时已被替换为 nvidia runc:
$ docker info | grep runc
Runtimes: nvidia runc
runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
再确认已安装 nvidia 显卡驱动且正常:
$ nvidia-smi
Thu Feb 20 17:37:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:06.0 Off | 0 |
| N/A 22C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
可以看到显卡驱动版本为 440.33.01。
手动拉起一个 nvidia-smi 容器进程来查看显卡信息:
$ nvidia-docker run --rm -it nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
与 nvidia-device-plugin 守护进程的错误日志相同。
遇到这种问题,我们可以使用nvidia-container-cli -k -d /dev/tty info 来查看下具体的问题:
$ nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0220 06:48:43.030305 26747 nvc.c:281] initializing library context (version=1.0.7, build=b71f87c04b8eca8a16bf60995506c35c937347d9)
I0220 06:48:43.030587 26747 nvc.c:255] using root /
I0220 06:48:43.030612 26747 nvc.c:256] using ldcache /etc/ld.so.cache
I0220 06:48:43.030631 26747 nvc.c:257] using unprivileged user 65534:65534
I0220 06:48:43.036513 26748 nvc.c:191] loading kernel module nvidia
I0220 06:48:43.037482 26748 nvc.c:203] loading kernel module nvidia_uvm
E0220 06:48:43.048209 26748 nvc.c:205] could not load kernel module nvidia_uvm
I0220 06:48:43.048247 26748 nvc.c:211] loading kernel module nvidia_modeset
E0220 06:48:43.054149 26748 nvc.c:213] could not load kernel module nvidia_modeset
I0220 06:48:43.054703 26751 driver.c:133] starting driver service
E0220 06:48:43.055263 26751 driver.c:197] could not start driver service: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory
I0220 06:48:43.055432 26747 driver.c:233] driver service terminated successfully
nvidia-container-cli: initialization error: driver error: failed to process request
缺少 libcuda.so 动态链接库。
解决
安装 cuda 驱动:
$ wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
$ rpm -i cuda-repo-rhel7-10-2-local-10.2.89-440.33.01-1.0-1.x86_64.rpm
$ yum -y install nvidia-driver-latest-dkms cuda
$ yum -y install cuda-drivers
安装完后手动启动 nvidia-smi 容器进程来查看显卡信息:
$ nvidia-docker run --rm -it nvidia/cuda:9.0-base nvidia-smi
Thu Feb 20 09:53:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:06.0 Off | 0 |
| N/A 22C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
此时 nvidia-device-plugin DaemonSet 已经启动成功:
$ kubectl get ds -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-device-plugin-daemonset 1 1 1 1 1 <none> 5h49m
查看 Kubernetes 节点 GPU 信息:
$ kubectl describe node k8s_node
Capacity:
cpu: 4
ephemeral-storage: 16765932Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4044788Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 15451482906
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3942388Ki
nvidia.com/gpu: 1
pods: 110