打开控制台,进入创建工作节点组页面。
选择GPU节点规格
更多详情参考创建工作节点组帮助文档。
注:GPU型当前仅在华北-北京单可用区公测提供;
工作节点组创建完成并且工作节点组处于运行状态后,您需要在GPU节点安装 NVIDIA device plugin和NVIDIA GPU驱动。
NVIDIA device plugin以daemonset的方式在Kubernetes节点上运行,与kubelet通过gRPC的方式通信,将节点上GPU的数量上报给kubelet,同时对GPU进行健康检查,保证集群中使用GPU的container正常运行。NVIDIA device plugin的安装步骤参考如下说明:
apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: "nvidia-device-plugin-ds" updateStrategy: type: RollingUpdate template: metadata: # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler # reserves resources for critical add-on pods so that they can be rescheduled after # a failure. This annotation works in tandem with the toleration below. annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name: nvidia-device-plugin-ds spec: tolerations: # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode. # This, along with the annotation above marks this pod as a critical add-on. - key: CriticalAddonsOnly operator: Exists - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - image: jdcloud-cn-north-1.jcr.service.jdcloud.com/k8s-device-plugin:1.11 # NVIDIA device plugin的镜像名称 name: nvidia-device-plugin-ctr securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
kubectl create -f https://jke-component-cn-north-1.s3.cn-north-1.jdcloud-oss.com/apps/gpu/spec/nvidia-device-plugin.yml
kubectl get daemonset -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin-daemonset 5 5 5 5 5 <none> 34m
您需要为GPU节点安装NVDIA GPU驱动,在Kubernetes集群中驱动以Daemonset的方式运行在GPU节点。
京东云为Tesla P40和Tesla V100两种型号的NVIDIA提供了GPU驱动镜像文件,镜像文件基于NVIDIA-Linux-x86_64-410.104.run构建,如需使用该版本的NVIDIA GPU驱动,请将image名称设置为 jdcloud-cn-north-1.jcr.service.jdcloud.com/nvidia-gpu-driver-installer:v1。
本例将以上述NVIDIA GPU驱动镜像为例,说明在京东云Nvidia Tesla P40类型的GPU节点上安装NVIDIA GPU的步骤。
apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-driver-installer namespace: kube-system labels: k8s-app: nvidia-driver-installer spec: selector: matchLabels: k8s-app: nvidia-driver-installer updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-driver-installer k8s-app: nvidia-driver-installer spec: affinity: nodeAffinity: # 描述约束pod调度的node affinity规则 requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: #描述node label必须满足的node selector规则,京东云为GPU Node默认添加了一组label,内容如下:jdcloud.com/jke-accelerator=nvidia-tesla-p40 - key: jdcloud.com/jke-accelerator #京东云为GPU Node节点默认添加的label key operator: In values: - nvidia-tesla-p40 # 京东云为GPU Node节点默认添加的label value,Value值根据GPU型号确定 tolerations: - operator: "Exists" # key为空,operator为Exists时表示匹配所有的key、value和effects hostNetwork: true hostPID: true volumes: - name: dev hostPath: path: /dev - name: nvidia-install-dir-host hostPath: path: /home/kubernetes/bin/nvidia - name: root-mount hostPath: path: / initContainers: - image: jdcloud-cn-north-1.jcr.service.jdcloud.com/nvidia-gpu-driver-installer:v1 #京东云基于NVIDIA-Linux-x86_64-410.104.run构建的GPU驱动镜像名称 name: nvidia-driver-installer securityContext: privileged: true env: - name: NVIDIA_INSTALL_DIR_HOST value: /home/kubernetes/bin/nvidia - name: NVIDIA_INSTALL_DIR_CONTAINER value: /usr/local/nvidia - name: ROOT_MOUNT_DIR value: /root volumeMounts: - name: nvidia-install-dir-host mountPath: /usr/local/nvidia - name: dev mountPath: /dev - name: root-mount mountPath: /root containers: - image: "jdcloud-cn-north-1.jcr.service.jdcloud.com/k8s/pause-amd64:3.1" name: pause
注:GPU节点规格及节点label对应关系参考下表:
GPU规格 | Node节点Label key | Node节点Label value |
---|---|---|
p.n1p40系列 | jdcloud.com/jke-accelerator | nvidia-tesla-p40 |
p.n1v100系列 | jdcloud.com/jke-accelerator | nvidia-tesla-v100 |
kubectl create -f https://jke-component-cn-north-1.s3.cn-north-1.jdcloud-oss.com/apps/gpu/spec/nvidia-driver-installer-daemonset.yaml
kubectl get daemonset -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-driver-installer 5 5 5 5 5 <none> 13s
apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-base resources: limits: nvidia.com/gpu: 0 command: - /bin/sh - -c - "while true; do nvidia-smi;sleep 1 ; done"
等待一段时间后,执行kubectl logs gpu-pod,并参考下图验证输出结果。
您也可以根据NVIDIA GPU型号及NVIDIA官方提供的最新版本驱动构建自定义镜像,完成NVIDIA驱动的安装,详情参考部署自定义版本的NVIDIA GPU驱动。