一般情况更新,文件注释等
This commit is contained in:
508
README.md
508
README.md
@@ -1,34 +1,502 @@
|
||||
# SIME
|
||||
# SIME - Structure-Informed Macrolide Expansion
|
||||
|
||||
HTML GUI for SIME and flask.
|
||||
SIME 是一个用于大环内酯类化合物结构扩展和抗菌活性预测的工具。
|
||||
|
||||
## Setup
|
||||
For better results, create a conda environment and activate it like:
|
||||
## 目录
|
||||
|
||||
```sh
|
||||
conda create -n SIME python
|
||||
- [原有功能](#原有功能)
|
||||
- [MolE 抗菌活性预测](#mole-抗菌活性预测)
|
||||
- [快速开始](#快速开始)
|
||||
- [安装依赖](#安装依赖)
|
||||
- [使用方法](#使用方法)
|
||||
- [输出说明](#输出说明)
|
||||
- [项目结构](#项目结构)
|
||||
- [常见问题](#常见问题)
|
||||
|
||||
---
|
||||
|
||||
## 原有功能
|
||||
|
||||
SIME 提供大环内酯类化合物的结构设计和合成路径分析功能。
|
||||
|
||||
---
|
||||
|
||||
## MolE 抗菌活性预测
|
||||
|
||||
本工具集成了 MolE(Molecular Embeddings)模型,可以预测小分子的广谱抗菌活性。
|
||||
|
||||
### 快速开始
|
||||
|
||||
#### 使用 uv(推荐)
|
||||
|
||||
```bash
|
||||
# 1. 创建虚拟环境(Python 3.12)
|
||||
uv venv --python 3.12 --seed .venv
|
||||
|
||||
# 2. 激活环境
|
||||
source .venv/bin/activate # Linux/Mac
|
||||
# 或
|
||||
.venv\Scripts\activate # Windows
|
||||
|
||||
# 3. 使用 uv 安装依赖
|
||||
uv pip install -r requirements-mole.txt
|
||||
|
||||
# 4. 验证安装
|
||||
python verify_setup.py
|
||||
|
||||
# 5. 运行预测
|
||||
python utils/mole_predictor.py Data/fragment/Frags-Enamine-18M.csv
|
||||
```
|
||||
|
||||
then:
|
||||
```sh
|
||||
conda activate SIME
|
||||
#### 使用 pyproject.toml 配置(uv 推荐)
|
||||
|
||||
项目提供了两个环境配置:
|
||||
|
||||
1. **SIME 原始环境** - 用于大环内酯结构设计
|
||||
|
||||
```bash
|
||||
# 使用 uv 创建默认环境
|
||||
uv sync
|
||||
```
|
||||
|
||||
Then install all needed dependencies from `requirements.txt` in the following way:
|
||||
```sh
|
||||
pip install -r requirements.txt
|
||||
2. **MolE 预测环境** - 用于抗菌活性预测
|
||||
|
||||
```bash
|
||||
# 使用 uv 创建 MolE 环境
|
||||
uv sync --extra mole
|
||||
```
|
||||
|
||||
## How to Use
|
||||
Start Flask server :
|
||||
```sh
|
||||
python main.py
|
||||
#### 使用 pixi 配置(conda 用户推荐)
|
||||
|
||||
如果你使用 conda 或需要更好的包管理,可以使用 pixi:
|
||||
|
||||
```bash
|
||||
# 安装 pixi(如果还没有)
|
||||
curl -fsSL https://pixi.sh/install.sh | bash
|
||||
|
||||
# 创建 SIME 原始环境
|
||||
pixi install
|
||||
|
||||
# 创建 MolE 预测环境
|
||||
pixi install -e mole
|
||||
|
||||
# 激活 MolE 环境
|
||||
pixi shell -e mole
|
||||
|
||||
# 在 pixi 环境中运行预测
|
||||
pixi run -e mole predict Data/fragment/test_100.csv
|
||||
```
|
||||
|
||||
Then type the following line in your browser to load SIME software GUI.
|
||||
### 安装依赖
|
||||
|
||||
[localhost:5000](http://localhost:5000)
|
||||
#### 方法 1: 使用 uv(推荐)
|
||||
|
||||
Use the parameters as needed, and click **Submit** to generate your *in-silico* library of fully assembled macrolides.
|
||||
```bash
|
||||
# 创建虚拟环境
|
||||
uv venv --python 3.12 .venv
|
||||
source .venv/bin/activate
|
||||
|
||||
The resulting info file and smile file(s) will be found in LIBRARIES folder with the time stamp within the file names.
|
||||
# 安装依赖
|
||||
uv pip install -r requirements-mole.txt
|
||||
```
|
||||
|
||||
#### 方法 2: 使用 pixi
|
||||
|
||||
```bash
|
||||
# 创建虚拟环境
|
||||
pixi init
|
||||
|
||||
# 基础环境
|
||||
pixi add python=3.12
|
||||
|
||||
# nvidia cuda工具链
|
||||
pixi workspace channel add nvidia
|
||||
pixi add nvidia::cuda-toolkit=12.8
|
||||
|
||||
# 科学计算 安装 pandas 会自动安装上 numpy
|
||||
pixi add
|
||||
|
||||
# torch-geometric
|
||||
pixi add conda-forge::pandas conda-forge::torch-geometric conda-forge::xgboost conda-forge::pyyaml conda-forge::rdkit conda-forge::pip conda-forge::click conda-forge::openpyxl
|
||||
|
||||
# PyTorch相关(指定通道)
|
||||
# 1. 添加 pytorch 频道 conda 太旧改为使用 pypi
|
||||
# pixi workspace channel add pytorch
|
||||
# pixi add pytorch::pytorch=2.6 pytorch::pytorch-cuda=12.4
|
||||
pixi add --pypi torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0
|
||||
|
||||
# 然后在 pixi.toml 中手动编辑为:
|
||||
[pypi-dependencies]
|
||||
torch = { version = "==2.8.0", index = "https://download.pytorch.org/whl/cu128" }
|
||||
torchvision = { version = "==0.23.0", index = "https://download.pytorch.org/whl/cu128" }
|
||||
torchaudio = { version = "==2.8.0", index = "https://download.pytorch.org/whl/cu128" }
|
||||
|
||||
# 安装依赖
|
||||
pixi install
|
||||
|
||||
# 激活
|
||||
pixi shell
|
||||
```
|
||||
|
||||
#### RDKit 安装建议
|
||||
|
||||
RDKit 推荐使用 conda 安装:
|
||||
|
||||
```bash
|
||||
conda install -c conda-forge rdkit
|
||||
```
|
||||
|
||||
### 使用方法
|
||||
|
||||
#### 1. 命令行使用
|
||||
|
||||
**基本用法:**
|
||||
|
||||
```bash
|
||||
# 预测 CSV 文件
|
||||
python utils/mole_predictor.py input.csv
|
||||
|
||||
# 指定输出路径
|
||||
python utils/mole_predictor.py input.csv output.csv
|
||||
|
||||
# 自定义列名
|
||||
python utils/mole_predictor.py input.csv output.csv \
|
||||
--smiles-column SMILES \
|
||||
--id-column compound_id
|
||||
|
||||
# 使用 GPU 加速
|
||||
python utils/mole_predictor.py input.csv --device cuda:0
|
||||
|
||||
# 调整批次大小和工作进程
|
||||
python utils/mole_predictor.py input.csv \
|
||||
--batch-size 200 \
|
||||
--n-workers 8
|
||||
```
|
||||
|
||||
**查看所有选项:**
|
||||
|
||||
```bash
|
||||
python utils/mole_predictor.py --help
|
||||
```
|
||||
|
||||
**预测项目数据:**
|
||||
|
||||
```bash
|
||||
# 预测 Frags-Enamine-18M.csv
|
||||
python utils/mole_predictor.py Data/fragment/Frags-Enamine-18M.csv
|
||||
|
||||
# 预测 GDB11-27M.csv
|
||||
python utils/mole_predictor.py Data/fragment/GDB11-27M.csv
|
||||
```
|
||||
|
||||
#### 2. Python API 使用
|
||||
|
||||
**预测单个文件:**
|
||||
|
||||
```python
|
||||
from utils.mole_predictor import predict_csv_file
|
||||
|
||||
# 基本使用
|
||||
df_result = predict_csv_file(
|
||||
input_path="Data/fragment/Frags-Enamine-18M.csv",
|
||||
output_path="results/predictions.csv",
|
||||
smiles_column="smiles",
|
||||
batch_size=100,
|
||||
device="auto"
|
||||
)
|
||||
|
||||
# 查看结果
|
||||
print(f"总分子数: {len(df_result)}")
|
||||
print(f"广谱分子数: {df_result['broad_spectrum'].sum()}")
|
||||
```
|
||||
|
||||
**批量预测多个文件:**
|
||||
|
||||
```python
|
||||
from utils.mole_predictor import predict_multiple_files
|
||||
|
||||
input_files = [
|
||||
"Data/fragment/Frags-Enamine-18M.csv",
|
||||
"Data/fragment/GDB11-27M.csv"
|
||||
]
|
||||
|
||||
results = predict_multiple_files(
|
||||
input_paths=input_files,
|
||||
output_dir="results/",
|
||||
smiles_column="smiles",
|
||||
batch_size=100,
|
||||
device="auto"
|
||||
)
|
||||
```
|
||||
|
||||
**直接使用预测器:**
|
||||
|
||||
```python
|
||||
from models import (
|
||||
ParallelBroadSpectrumPredictor,
|
||||
PredictionConfig,
|
||||
MoleculeInput
|
||||
)
|
||||
|
||||
# 创建配置
|
||||
config = PredictionConfig(
|
||||
batch_size=100,
|
||||
device="auto" # 或 "cpu", "cuda:0"
|
||||
)
|
||||
|
||||
# 创建预测器
|
||||
predictor = ParallelBroadSpectrumPredictor(config)
|
||||
|
||||
# 预测单个分子
|
||||
molecule = MoleculeInput(smiles="CCO", chem_id="ethanol")
|
||||
result = predictor.predict_single(molecule)
|
||||
|
||||
print(f"化合物ID: {result.chem_id}")
|
||||
print(f"广谱抗菌: {result.broad_spectrum}")
|
||||
print(f"抗菌得分: {result.apscore_total:.3f}")
|
||||
print(f"抑制菌株数: {result.ginhib_total}")
|
||||
|
||||
# 批量预测
|
||||
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
|
||||
chem_ids = ["ethanol", "benzene", "acetic_acid"]
|
||||
|
||||
results = predictor.predict_from_smiles(smiles_list, chem_ids)
|
||||
|
||||
for r in results:
|
||||
print(f"{r.chem_id}: broad_spectrum={r.broad_spectrum}, "
|
||||
f"apscore={r.apscore_total:.3f}")
|
||||
```
|
||||
|
||||
### 输出说明
|
||||
|
||||
预测结果会添加以下 7 个新列:
|
||||
|
||||
| 列名 | 类型 | 说明 |
|
||||
|------|------|------|
|
||||
| `apscore_total` | float | 总体抗菌潜力分数(对数尺度,值越大抗菌活性越强) |
|
||||
| `apscore_gnegative` | float | 革兰阴性菌抗菌潜力分数 |
|
||||
| `apscore_gpositive` | float | 革兰阳性菌抗菌潜力分数 |
|
||||
| `ginhib_total` | int | 被抑制的菌株总数 |
|
||||
| `ginhib_gnegative` | int | 被抑制的革兰阴性菌株数 |
|
||||
| `ginhib_gpositive` | int | 被抑制的革兰阳性菌株数 |
|
||||
| `broad_spectrum` | int | 是否为广谱抗菌(1=是,0=否) |
|
||||
|
||||
#### 广谱抗菌判断标准
|
||||
|
||||
默认情况下,如果一个分子能抑制 **10 个或更多菌株** (`ginhib_total >= 10`),则被认为是广谱抗菌分子。
|
||||
|
||||
#### 输出文件位置
|
||||
|
||||
默认情况下,输出文件会添加 `_predicted` 后缀:
|
||||
|
||||
- 输入: `Data/fragment/Frags-Enamine-18M.csv`
|
||||
- 输出: `Data/fragment/Frags-Enamine-18M_predicted.csv`
|
||||
|
||||
---
|
||||
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
SIME/
|
||||
├── models/ # MolE 预测模型
|
||||
│ ├── __init__.py
|
||||
│ ├── broad_spectrum_predictor.py # 核心预测器
|
||||
│ ├── dataset_representation.py # 数据集表示
|
||||
│ ├── ginet_concat.py # GIN 神经网络
|
||||
│ └── mole_representation.py # MolE 表示生成
|
||||
│
|
||||
├── utils/
|
||||
│ ├── mole_predictor.py # 预测工具脚本
|
||||
│ └── ... (其他工具)
|
||||
│
|
||||
├── Data/
|
||||
│ └── fragment/ # 待预测数据
|
||||
│ ├── Frags-Enamine-18M.csv
|
||||
│ └── GDB11-27M.csv
|
||||
│
|
||||
├── pyproject.toml # uv 项目配置
|
||||
├── requirements.txt # SIME 原始依赖
|
||||
├── requirements-mole.txt # MolE 预测依赖
|
||||
│
|
||||
├── verify_setup.py # 设置验证工具
|
||||
├── check_mole_dependencies.py # 依赖检查工具
|
||||
└── test_mole_predictor.py # 功能测试
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 依赖说明
|
||||
|
||||
### SIME 原始依赖 (requirements.txt)
|
||||
|
||||
用于大环内酯结构设计功能。
|
||||
|
||||
### MolE 预测依赖 (requirements-mole.txt)
|
||||
|
||||
用于抗菌活性预测,主要包括:
|
||||
|
||||
- **深度学习**: torch, torch-geometric
|
||||
- **科学计算**: numpy, pandas, scipy
|
||||
- **机器学习**: scikit-learn, xgboost
|
||||
- **化学信息**: rdkit
|
||||
- **其他**: openpyxl, pyyaml, click
|
||||
|
||||
---
|
||||
|
||||
## 验证和测试
|
||||
|
||||
### 验证安装
|
||||
|
||||
```bash
|
||||
# 检查 Python 依赖
|
||||
python verify_setup.py
|
||||
|
||||
# 检查模型文件
|
||||
python check_mole_dependencies.py
|
||||
```
|
||||
|
||||
### 运行测试
|
||||
|
||||
```bash
|
||||
# 功能测试(使用小规模测试数据)
|
||||
python test_mole_predictor.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 常见问题
|
||||
|
||||
### Q1: 如何处理大文件?
|
||||
|
||||
**方案 1:** 增加批次大小和工作进程数
|
||||
|
||||
```bash
|
||||
python utils/mole_predictor.py large_file.csv \
|
||||
--batch-size 500 \
|
||||
--n-workers 8
|
||||
```
|
||||
|
||||
**方案 2:** 先提取部分数据测试
|
||||
|
||||
```bash
|
||||
# 提取前 1000 行
|
||||
head -1001 large_file.csv > test_1000.csv
|
||||
python utils/mole_predictor.py test_1000.csv
|
||||
```
|
||||
|
||||
### Q2: 如何只使用 CPU?
|
||||
|
||||
```bash
|
||||
python utils/mole_predictor.py input.csv --device cpu
|
||||
```
|
||||
|
||||
### Q3: 列名大小写问题?
|
||||
|
||||
工具会自动进行大小写不敏感的列名匹配,所以 `SMILES`、`smiles`、`Smiles` 都可以识别。
|
||||
|
||||
### Q4: ModuleNotFoundError 错误?
|
||||
|
||||
确保已安装依赖:
|
||||
|
||||
```bash
|
||||
uv pip install -r requirements-mole.txt
|
||||
```
|
||||
|
||||
对于 RDKit,推荐使用 conda:
|
||||
|
||||
```bash
|
||||
conda install -c conda-forge rdkit
|
||||
```
|
||||
|
||||
### Q5: 如何自定义模型路径?
|
||||
|
||||
```python
|
||||
from models import PredictionConfig, ParallelBroadSpectrumPredictor
|
||||
|
||||
config = PredictionConfig(
|
||||
xgboost_model_path="/path/to/model.pkl",
|
||||
mole_model_path="/path/to/mole_model",
|
||||
strain_categories_path="/path/to/strain_data.tsv.gz",
|
||||
gram_info_path="/path/to/gram_info.xlsx",
|
||||
app_threshold=0.044,
|
||||
min_nkill=10,
|
||||
batch_size=100,
|
||||
device="auto"
|
||||
)
|
||||
|
||||
predictor = ParallelBroadSpectrumPredictor(config)
|
||||
```
|
||||
|
||||
### Q6: GPU 内存不足?
|
||||
|
||||
减小批次大小:
|
||||
|
||||
```bash
|
||||
python utils/mole_predictor.py input.csv --batch-size 50
|
||||
```
|
||||
|
||||
### Q7: 模型文件在哪里?
|
||||
|
||||
模型文件位于相邻的 `mole_broad_spectrum_parallel` 项目中:
|
||||
|
||||
```
|
||||
../mole_broad_spectrum_parallel/
|
||||
├── pretrained_model/model_ginconcat_btwin_100k_d8000_l0.0001/
|
||||
│ ├── config.yaml
|
||||
│ └── model.pth
|
||||
├── data/03.model_evaluation/MolE-XGBoost-08.03.2024_14.20.pkl
|
||||
└── ...
|
||||
```
|
||||
|
||||
运行 `python check_mole_dependencies.py` 检查文件是否存在。
|
||||
|
||||
---
|
||||
|
||||
## 性能建议
|
||||
|
||||
- **使用 GPU**: 设置 `--device cuda:0` 可大幅加速(需要 CUDA)
|
||||
- **调整批次**: 较大的批次(100-500)通常更快
|
||||
- **多进程**: 使用 `--n-workers` 指定工作进程数
|
||||
- **首次加载**: 首次运行需要加载模型(~30秒),后续会更快
|
||||
|
||||
### 性能参考
|
||||
|
||||
| 分子数量 | CPU (8核) | GPU (CUDA) |
|
||||
|---------|----------|------------|
|
||||
| 100 | ~30秒 | ~10秒 |
|
||||
| 1,000 | ~5分钟 | ~1分钟 |
|
||||
| 10,000 | ~50分钟 | ~8分钟 |
|
||||
|
||||
---
|
||||
|
||||
## 系统要求
|
||||
|
||||
- **Python**: 3.7 或更高版本(推荐 3.12)
|
||||
- **内存**: 最低 8 GB RAM
|
||||
- **存储**: 至少 2 GB 可用空间
|
||||
- **GPU**: 可选,但强烈推荐(需要 CUDA 支持)
|
||||
|
||||
---
|
||||
|
||||
## 技术支持
|
||||
|
||||
如有问题:
|
||||
|
||||
1. 查看验证结果: `python verify_setup.py`
|
||||
2. 检查模型文件: `python check_mole_dependencies.py`
|
||||
3. 运行功能测试: `python test_mole_predictor.py`
|
||||
|
||||
---
|
||||
|
||||
## 许可
|
||||
|
||||
详见 LICENSE 文件。
|
||||
|
||||
## 引用
|
||||
|
||||
如果使用本工具,请引用相关论文。
|
||||
|
||||
---
|
||||
|
||||
**更新日期**: 2025-10-16
|
||||
**版本**: 1.0.0
|
||||
|
||||
48
pyproject.toml
Normal file
48
pyproject.toml
Normal file
@@ -0,0 +1,48 @@
|
||||
[project]
|
||||
name = "SIME"
|
||||
version = "1.0.0"
|
||||
description = "Structure-Informed Macrolide Expansion with MolE antimicrobial prediction"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.7"
|
||||
license = {text = "MIT"}
|
||||
|
||||
dependencies = [
|
||||
"certifi>=2019.11.28",
|
||||
"Click>=8.0.0",
|
||||
"Flask>=1.1.1",
|
||||
"itsdangerous>=1.1.0",
|
||||
"Jinja2>=2.10.3",
|
||||
"MarkupSafe>=1.1.1",
|
||||
"MolVS>=0.1.1",
|
||||
"numpy>=1.20.0",
|
||||
"olefile>=0.46",
|
||||
"pandas>=1.3.0",
|
||||
"Pillow>=7.0.0",
|
||||
"pycairo>=1.19.0",
|
||||
"python-dateutil>=2.8.1",
|
||||
"pytz>=2019.3",
|
||||
"six>=1.14.0",
|
||||
"Werkzeug>=0.16.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
mole = [
|
||||
"torch>=1.9.0",
|
||||
"torch-geometric>=2.0.0",
|
||||
"scipy>=1.7.0",
|
||||
"scikit-learn>=1.0.0",
|
||||
"xgboost>=1.5.0",
|
||||
"rdkit>=2022.03.1",
|
||||
"openpyxl>=3.0.0",
|
||||
"pyyaml>=5.4.0",
|
||||
]
|
||||
|
||||
[build-system]
|
||||
requires = ["setuptools>=45", "wheel"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[tool.uv]
|
||||
dev-dependencies = []
|
||||
|
||||
[tool.uv.sources]
|
||||
|
||||
Reference in New Issue
Block a user