57 lines
1.7 KiB
Markdown
57 lines
1.7 KiB
Markdown
|
||
## 设置启动时候的环境准备
|
||
|
||
准备 uv 虚拟环境与国内 huggingface mirror 镜像
|
||
|
||
```bash
|
||
UV_PYTHON=python3.12 uv venv .venv
|
||
source .venv/bin/activate
|
||
uv pip install embedding-atlas ipykernel anywidget notebook rdkit pandas selfies==2.1.1 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||
streamlit run app.py
|
||
export HF_HUB_OFFLINE=1
|
||
export HF_ENDPOINT=https://hf-mirror.com
|
||
```
|
||
|
||
## 命令行生成嵌入可视化交互
|
||
|
||
```bash
|
||
embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
|
||
embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --export-application data/my_visualization.zip
|
||
```
|
||
|
||
`embedding-atlas`命令行使用方法
|
||
|
||
```bash
|
||
本地文件:embedding-atlas dataset.parquet
|
||
Hugging Face 数据集:embedding-atlas huggingface_org/dataset_name
|
||
指定文本列:embedding-atlas dataset.parquet --text text_column
|
||
预计算坐标:embedding-atlas dataset.parquet --x projection_x --y projection_y
|
||
```
|
||
|
||
## 划分 MolGen 第一轮微调数据集
|
||
|
||
```bash
|
||
python script/split_drugbank.py \
|
||
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
|
||
--out-dir splits_v2 --seed 20250922 \
|
||
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
|
||
--n_qed_bins 5 --n_mw_bins 5 --largest-first
|
||
|
||
```
|
||
|
||
产物:split_train.csv / split_val.csv / split_test.csv
|
||
其中 split_val/test 中的分子绝不出现在训练,且整体 QED/MW 分布接近训练集,便于后续“用未见参考分子做条件生成并观察邻域覆盖”。
|
||
|
||
## 合并分割的数据集进行可视化
|
||
|
||
合并数据集
|
||
|
||
```bash
|
||
python3 ./script/merge_splits.py --input-dir splits_v2/ --output data/drugbank_split_merge.csv
|
||
```
|
||
|
||
可视化
|
||
|
||
```bash
|
||
embedding-atlas data/drugbank_split_merge.csv --text smiles
|
||
``` |