Files
embedding_atlas/README.md
2025-09-22 20:06:39 +08:00

57 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## 设置启动时候的环境准备
准备 uv 虚拟环境与国内 huggingface mirror 镜像
```bash
UV_PYTHON=python3.12 uv venv .venv
source .venv/bin/activate
uv pip install embedding-atlas ipykernel anywidget notebook rdkit pandas selfies==2.1.1 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
streamlit run app.py
export HF_HUB_OFFLINE=1
export HF_ENDPOINT=https://hf-mirror.com
```
## 命令行生成嵌入可视化交互
```bash
embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --text smiles
embedding-atlas data/drugbank_pre_filtered_mordred_qed_id_selfies.csv --export-application data/my_visualization.zip
```
`embedding-atlas`命令行使用方法
```bash
本地文件embedding-atlas dataset.parquet
Hugging Face 数据集embedding-atlas huggingface_org/dataset_name
指定文本列embedding-atlas dataset.parquet --text text_column
预计算坐标embedding-atlas dataset.parquet --x projection_x --y projection_y
```
## 划分 MolGen 第一轮微调数据集
```bash
python script/split_drugbank.py \
--in-csv data/drugbank_pre_filtered_mordred_qed_id_selfies.csv \
--out-dir splits_v2 --seed 20250922 \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 \
--n_qed_bins 5 --n_mw_bins 5 --largest-first
```
产物split_train.csv / split_val.csv / split_test.csv
其中 split_val/test 中的分子绝不出现在训练,且整体 QED/MW 分布接近训练集,便于后续“用未见参考分子做条件生成并观察邻域覆盖”。
## 合并分割的数据集进行可视化
合并数据集
```bash
python3 ./script/merge_splits.py --input-dir splits_v2/ --output data/drugbank_split_merge.csv
```
可视化
```bash
embedding-atlas data/drugbank_split_merge.csv --text smiles
```