cdc_dockerfile

lingyuzeng/cdc_dockerfile

Fork 0

Files

History

Your Name d47f32d3c5 update dockerfile

2024-07-04 01:45:31 +00:00

accelerate-gpu-deepspeed.Dockerfile

update

2024-06-12 16:53:36 +08:00

binbbt.tar.gz

add deepspeed test code in bgpt

2024-06-29 08:46:36 +00:00

configure_gpu.sh

update

2024-07-02 08:00:41 +00:00

deepspeed.Dockerfile

update

2024-06-12 16:53:36 +08:00

docker-compose_nccl.yml

update

2024-07-02 06:08:08 +00:00

docker-compose_pytorch1.13.yml

update

2024-06-21 15:12:44 +08:00

docker-compose_pytorch2.3_device.yml

update

2024-07-02 06:08:08 +00:00

docker-compose_pytorch2.3.yml

修改端口为22端口，并且启动ssh的方式有所改变

2024-06-29 02:10:48 +00:00

docker-compose_stack1.yml

update

2024-07-02 10:17:29 +00:00

docker-compose_stack2.yml

update SRIOV

2024-07-03 08:16:01 +00:00

docker-compose_stack.yml

update

2024-07-02 10:17:29 +00:00

docker-compose_swarm.yml

change networks

2024-06-29 06:24:11 +00:00

docker-compose.yml

update

2024-06-20 16:17:56 +08:00

Dockerfile

update dockerfile

2024-07-04 01:45:31 +00:00

Dockerfile.bak

修改端口为22端口，并且启动ssh的方式有所改变

2024-06-29 02:10:48 +00:00

hostfile

update

2024-06-22 10:43:53 +00:00

id_rsa_finetune

修改端口为22端口，并且启动ssh的方式有所改变

2024-06-29 02:10:48 +00:00

id_rsa.pub

修改端口为22端口，并且启动ssh的方式有所改变

2024-06-29 02:10:48 +00:00

peft-gpu-bnb-multi-source.Dockerfile

update

2024-06-12 16:53:36 +08:00

README.md

update

2024-07-04 01:45:12 +00:00

requirements.txt

add pip list version

2024-06-20 16:18:14 +08:00

setup_ssh.sh

add setup ssh

2024-06-29 08:56:51 +00:00

test.txt

add pip list version

2024-06-20 16:18:14 +08:00

transformer.Dockerfile

update

2024-06-12 16:53:36 +08:00

update_sriov_vf.sh

add auto script in configure in /etc/docker/daemon.json

2024-07-03 04:30:55 +00:00

README.md

deepspeed docker image build

docker-compose -f docker-compose_pytorch1.13.yml build
docker-compose -f docker-compose_pytorch2.3.yml build

物理机更新内核

uname -r # 5.4.0-144-generic
lsb_release -a
sudo apt-get update # This will update the repositories list
sudo apt-get upgrade # This will update all the necessary packages on your system
sudo apt-get dist-upgrade # This will add/remove any needed packages
reboot # You may need this since sometimes after a upgrade/dist-upgrade, there are some left over entries that get fixed after a reboot
sudo apt-get install linux-headers-$(uname -r) # This should work now

test command

docker run -it --gpus all --name deepspeed_test --shm-size=1gb --rm hotwa/deepspeed:latest /bin/bash

查询GPU 架构给变量赋值

git clone https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps.git
cd deepstream_tlt_apps/TRT-OSS/x86
nvcc deviceQuery.cpp -o deviceQuery
./deviceQuery

H100 输出

(base) root@node19:~/bgpt/deepstream_tlt_apps/TRT-OSS/x86# ./deviceQuery
Detected 8 CUDA Capable device(s)

Device 0: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 1: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 2: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 3: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 4: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 5: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 6: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

Device 7: "NVIDIA H100 80GB HBM3"
  CUDA Driver Version / Runtime Version          12.4 / 10.1
  CUDA Capability Major/Minor version number:    9.0

DeepSpeed hostfile 分发

要手动分发 hostfile 并进行分布式安装，你需要以下几个步骤：

准备 hostfile 确保 hostfile 文件包含所有参与的主机及其配置。

示例 hostfile 内容：

host1 slots=4
host2 slots=4
host3 slots=8

确保 SSH 配置正确确保你能够通过 SSH 无密码登录到所有主机。可以使用 ssh-keygen 和 ssh-copy-id 配置 SSH 密钥。

生成 SSH 密钥（如果尚未生成）：

ssh-keygen -t rsa

将 SSH 公钥复制到每个主机：

ssh-copy-id user@host1
ssh-copy-id user@host2
ssh-copy-id user@host3

创建临时目录并复制 wheel 文件在所有主机上创建一个临时目录，用于存放分发的 wheel 文件。

export PDSH_RCMD_TYPE=ssh
hosts=$(cat /path/to/your/hostfile | awk '{print $1}' | paste -sd ",")
tmp_wheel_path="/tmp/deepspeed_wheels"

pdsh -w $hosts "mkdir -pv ${tmp_wheel_path}"
pdcp -w $hosts dist/deepspeed*.whl ${tmp_wheel_path}/
pdcp -w $hosts requirements/requirements.txt ${tmp_wheel_path}/

在每个主机上安装 DeepSpeed 和依赖项在所有主机上安装 DeepSpeed 和所需的依赖项。

pdsh -w $hosts "pip install ${tmp_wheel_path}/deepspeed*.whl"
pdsh -w $hosts "pip install -r ${tmp_wheel_path}/requirements.txt"

清理临时文件安装完成后，删除所有主机上的临时文件。

pdsh -w $hosts "rm -rf ${tmp_wheel_path}"

详细步骤确保 SSH 配置正确：

ssh-keygen -t rsa
ssh-copy-id user@host1
ssh-copy-id user@host2
ssh-copy-id user@host3

创建临时目录并复制文件：

export PDSH_RCMD_TYPE=ssh
hosts=$(cat /path/to/your/hostfile | awk '{print $1}' | paste -sd ",")
tmp_wheel_path="/tmp/deepspeed_wheels"

pdsh -w $hosts "mkdir -pv ${tmp_wheel_path}"
pdcp -w $hosts dist/deepspeed*.whl ${tmp_wheel_path}/
pdcp -w $hosts requirements/requirements.txt ${tmp_wheel_path}/

在所有主机上安装 DeepSpeed 和依赖项：

pdsh -w $hosts "pip install ${tmp_wheel_path}/deepspeed*.whl"
pdsh -w $hosts "pip install -r ${tmp_wheel_path}/requirements.txt"

清理临时文件：

pdsh -w $hosts "rm -rf ${tmp_wheel_path}"

通过这些步骤，你可以手动分发 hostfile 并在多个主机上安装 DeepSpeed 和其依赖项。这种方法确保了每个主机的环境配置一致，从而支持分布式训练或部署。

README.md

deepspeed docker image build

物理机更新内核

test command

查询GPU 架构 给变量赋值

DeepSpeed hostfile 分发

查询GPU 架构给变量赋值