170 lines
4.7 KiB
Markdown
170 lines
4.7 KiB
Markdown
## deepspeed docker image build
|
|
|
|
```shell
|
|
docker-compose -f docker-compose_pytorch1.13.yml build
|
|
docker-compose -f docker-compose_pytorch2.3.yml build
|
|
```
|
|
|
|
## 物理机更新内核
|
|
|
|
```shell
|
|
uname -r
|
|
sudo apt-get update # This will update the repositories list
|
|
sudo apt-get upgrade # This will update all the necessary packages on your system
|
|
sudo apt-get dist-upgrade # This will add/remove any needed packages
|
|
reboot # You may need this since sometimes after a upgrade/dist-upgrade, there are some left over entries that get fixed after a reboot
|
|
sudo apt-get install linux-headers-$(uname -r) # This should work now
|
|
```
|
|
|
|
## test command
|
|
|
|
```shell
|
|
docker run -it --gpus all --name deepspeed_test --shm-size=1gb --rm hotwa/deepspeed:latest /bin/bash
|
|
```
|
|
|
|
## [查询GPU 架构 给变量赋值](https://blog.csdn.net/zong596568821xp/article/details/106411024)
|
|
|
|
```shell
|
|
git clone https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps.git
|
|
cd deepstream_tlt_apps/TRT-OSS/x86
|
|
nvcc deviceQuery.cpp -o deviceQuery
|
|
./deviceQuery
|
|
```
|
|
|
|
H100 输出
|
|
|
|
```shell
|
|
(base) root@node19:~/bgpt/deepstream_tlt_apps/TRT-OSS/x86# ./deviceQuery
|
|
Detected 8 CUDA Capable device(s)
|
|
|
|
Device 0: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 1: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 2: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 3: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 4: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 5: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 6: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
Device 7: "NVIDIA H100 80GB HBM3"
|
|
CUDA Driver Version / Runtime Version 12.4 / 10.1
|
|
CUDA Capability Major/Minor version number: 9.0
|
|
|
|
```
|
|
|
|
|
|
## DeepSpeed hostfile 分发
|
|
|
|
要手动分发 hostfile 并进行分布式安装,你需要以下几个步骤:
|
|
|
|
1. 准备 hostfile
|
|
确保 hostfile 文件包含所有参与的主机及其配置。
|
|
|
|
示例 hostfile 内容:
|
|
|
|
```plaintext
|
|
host1 slots=4
|
|
host2 slots=4
|
|
host3 slots=8
|
|
```
|
|
|
|
2. 确保 SSH 配置正确
|
|
确保你能够通过 SSH 无密码登录到所有主机。可以使用 ssh-keygen 和 ssh-copy-id 配置 SSH 密钥。
|
|
|
|
生成 SSH 密钥(如果尚未生成):
|
|
|
|
```shell
|
|
ssh-keygen -t rsa
|
|
```
|
|
|
|
将 SSH 公钥复制到每个主机:
|
|
|
|
```shell
|
|
ssh-copy-id user@host1
|
|
ssh-copy-id user@host2
|
|
ssh-copy-id user@host3
|
|
```
|
|
|
|
3. 创建临时目录并复制 wheel 文件
|
|
在所有主机上创建一个临时目录,用于存放分发的 wheel 文件。
|
|
|
|
```shell
|
|
export PDSH_RCMD_TYPE=ssh
|
|
hosts=$(cat /path/to/your/hostfile | awk '{print $1}' | paste -sd ",")
|
|
tmp_wheel_path="/tmp/deepspeed_wheels"
|
|
|
|
pdsh -w $hosts "mkdir -pv ${tmp_wheel_path}"
|
|
pdcp -w $hosts dist/deepspeed*.whl ${tmp_wheel_path}/
|
|
pdcp -w $hosts requirements/requirements.txt ${tmp_wheel_path}/
|
|
```
|
|
|
|
4. 在每个主机上安装 DeepSpeed 和依赖项
|
|
在所有主机上安装 DeepSpeed 和所需的依赖项。
|
|
|
|
```shell
|
|
pdsh -w $hosts "pip install ${tmp_wheel_path}/deepspeed*.whl"
|
|
pdsh -w $hosts "pip install -r ${tmp_wheel_path}/requirements.txt"
|
|
```
|
|
|
|
5. 清理临时文件
|
|
安装完成后,删除所有主机上的临时文件。
|
|
|
|
```shell
|
|
pdsh -w $hosts "rm -rf ${tmp_wheel_path}"
|
|
```
|
|
|
|
详细步骤
|
|
确保 SSH 配置正确:
|
|
|
|
```shell
|
|
ssh-keygen -t rsa
|
|
ssh-copy-id user@host1
|
|
ssh-copy-id user@host2
|
|
ssh-copy-id user@host3
|
|
```
|
|
|
|
创建临时目录并复制文件:
|
|
|
|
```shell
|
|
export PDSH_RCMD_TYPE=ssh
|
|
hosts=$(cat /path/to/your/hostfile | awk '{print $1}' | paste -sd ",")
|
|
tmp_wheel_path="/tmp/deepspeed_wheels"
|
|
|
|
pdsh -w $hosts "mkdir -pv ${tmp_wheel_path}"
|
|
pdcp -w $hosts dist/deepspeed*.whl ${tmp_wheel_path}/
|
|
pdcp -w $hosts requirements/requirements.txt ${tmp_wheel_path}/
|
|
```
|
|
|
|
在所有主机上安装 DeepSpeed 和依赖项:
|
|
|
|
```shell
|
|
pdsh -w $hosts "pip install ${tmp_wheel_path}/deepspeed*.whl"
|
|
pdsh -w $hosts "pip install -r ${tmp_wheel_path}/requirements.txt"
|
|
```
|
|
|
|
清理临时文件:
|
|
|
|
```shell
|
|
pdsh -w $hosts "rm -rf ${tmp_wheel_path}"
|
|
```
|
|
|
|
通过这些步骤,你可以手动分发 hostfile 并在多个主机上安装 DeepSpeed 和其依赖项。这种方法确保了每个主机的环境配置一致,从而支持分布式训练或部署。 |