diff --git a/1d-qsar/cuda/README.md b/1d-qsar/cuda/README.md new file mode 100644 index 0000000..7e986e1 --- /dev/null +++ b/1d-qsar/cuda/README.md @@ -0,0 +1,31 @@ +## 优化 GPU 版本的方法 + +[安装方法](https://docs.rapids.ai/install/?_gl=1*1q2wl26*_ga*MjA2OTk3MDkzNS4xNzQwODI1MzUx*_ga_RKXFW6CM42*MTc0MDgyNTM1MC4xLjAuMTc0MDgyNTM1MC42MC4wLjA.#selector) + + +```shell +micromamba create -n rapids_env -y +micromamba activate rapids_env + +micromamba create -n rapids-25.02 -c rapidsai -c conda-forge -c nvidia \ + rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8' + +d +micromamba create -n rapids-25.02 -c rapidsai -c conda-forge -c nvidia \ + rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8' \ + 'pytorch=*=*cuda*' jupyterlab + +``` + +### 方法 1:使用 RAPIDS cuML 替代 scikit-learn + +RAPIDS cuML 提供了与 scikit-learn 兼容的 GPU 版本的 RandomForestRegressor,可以大幅提升计算速度: + + +## 方法 2:使用 XGBoost 进行特征选择 + +XGBoost 支持 GPU 训练,我们可以使用 XGBRegressor 替换 RandomForestRegressor 进行特征选择 + +## 方法 3:使用 Optuna 进行特征选择优化 + +Optuna 是一个自动超参数优化库,可以在 GPU 上搜索最佳特征子集: \ No newline at end of file diff --git a/1d-qsar/cuda/RFE_cuml_permutation.log b/1d-qsar/cuda/RFE_cuml_permutation.log new file mode 100644 index 0000000..c3a08ed --- /dev/null +++ b/1d-qsar/cuda/RFE_cuml_permutation.log @@ -0,0 +1,399 @@ +(rapids-25.02) root@DESK4090:~/project/qsar/1d-qsar/cuda# python RFE_cuml_permutation.py +训练样本数: 81, 特征数量: 156 +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:363: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set + return init_func(self, *args, **kwargs) +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188: UserWarning: The number of bins, `n_bins` is greater than the number of samples used for training. Changing `n_bins` to number of training samples. + ret = func(*args, **kwargs) +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188: UserWarning: To use pickling first train using float32 data to fit the estimator + ret = func(*args, **kwargs) +n_estimators: 50, CV MSE: 5.2602 +n_estimators: 100, CV MSE: 5.0974 +n_estimators: 150, CV MSE: 5.1422 +n_estimators: 200, CV MSE: 5.0988 +n_estimators: 250, CV MSE: 4.9330 +n_estimators: 300, CV MSE: 4.9259 +n_estimators: 350, CV MSE: 4.9253 +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 19981 (\N{CJK UNIFIED IDEOGRAPH-4E0D}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 21516 (\N{CJK UNIFIED IDEOGRAPH-540C}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 26641 (\N{CJK UNIFIED IDEOGRAPH-6811}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 25968 (\N{CJK UNIFIED IDEOGRAPH-6570}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 37327 (\N{CJK UNIFIED IDEOGRAPH-91CF}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 19979 (\N{CJK UNIFIED IDEOGRAPH-4E0B}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 30340 (\N{CJK UNIFIED IDEOGRAPH-7684}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 20132 (\N{CJK UNIFIED IDEOGRAPH-4EA4}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 21449 (\N{CJK UNIFIED IDEOGRAPH-53C9}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 39564 (\N{CJK UNIFIED IDEOGRAPH-9A8C}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:143: UserWarning: Glyph 35777 (\N{CJK UNIFIED IDEOGRAPH-8BC1}) missing from font(s) DejaVu Sans. + plt.savefig("tree_vs_cv_mse.png", dpi=300) +最佳随机森林树数量确定为: 350 +当前特征数: 156, CV MSE: 4.9253 +剔除特征索引: 133,置换重要性: -0.0004 +当前特征数: 155, CV MSE: 4.9970 +剔除特征索引: 124,置换重要性: -0.0009 +当前特征数: 154, CV MSE: 4.7611 +剔除特征索引: 152,置换重要性: -0.0006 +当前特征数: 153, CV MSE: 4.7928 +剔除特征索引: 47,置换重要性: -0.0001 +当前特征数: 152, CV MSE: 4.7032 +剔除特征索引: 107,置换重要性: -0.0010 +当前特征数: 151, CV MSE: 4.7571 +剔除特征索引: 155,置换重要性: -0.0000 +当前特征数: 150, CV MSE: 4.7501 +剔除特征索引: 11,置换重要性: -0.0004 +当前特征数: 149, CV MSE: 4.8415 +剔除特征索引: 52,置换重要性: -0.0001 +当前特征数: 148, CV MSE: 4.7807 +剔除特征索引: 143,置换重要性: -0.0001 +当前特征数: 147, CV MSE: 4.7937 +剔除特征索引: 12,置换重要性: -0.0008 +当前特征数: 146, CV MSE: 4.7589 +剔除特征索引: 2,置换重要性: -0.0007 +当前特征数: 145, CV MSE: 4.8444 +剔除特征索引: 125,置换重要性: -0.0000 +当前特征数: 144, CV MSE: 4.8518 +剔除特征索引: 119,置换重要性: -0.0019 +当前特征数: 143, CV MSE: 4.6953 +剔除特征索引: 67,置换重要性: -0.0000 +当前特征数: 142, CV MSE: 4.7689 +剔除特征索引: 77,置换重要性: -0.0001 +当前特征数: 141, CV MSE: 4.7446 +剔除特征索引: 55,置换重要性: -0.0001 +当前特征数: 140, CV MSE: 4.8110 +剔除特征索引: 66,置换重要性: -0.0011 +当前特征数: 139, CV MSE: 4.7815 +剔除特征索引: 0,置换重要性: 0.0000 +当前特征数: 138, CV MSE: 4.8477 +剔除特征索引: 148,置换重要性: -0.0003 +当前特征数: 137, CV MSE: 4.7395 +剔除特征索引: 48,置换重要性: -0.0000 +当前特征数: 136, CV MSE: 4.6840 +剔除特征索引: 150,置换重要性: -0.0000 +当前特征数: 135, CV MSE: 4.8117 +剔除特征索引: 60,置换重要性: -0.0001 +当前特征数: 134, CV MSE: 4.7988 +剔除特征索引: 149,置换重要性: -0.0000 +当前特征数: 133, CV MSE: 4.7597 +剔除特征索引: 76,置换重要性: -0.0000 +当前特征数: 132, CV MSE: 4.7059 +剔除特征索引: 69,置换重要性: -0.0000 +当前特征数: 131, CV MSE: 4.7498 +剔除特征索引: 51,置换重要性: -0.0000 +当前特征数: 130, CV MSE: 4.7434 +剔除特征索引: 142,置换重要性: -0.0000 +当前特征数: 129, CV MSE: 4.7211 +剔除特征索引: 105,置换重要性: -0.0001 +当前特征数: 128, CV MSE: 4.8107 +剔除特征索引: 53,置换重要性: -0.0000 +当前特征数: 127, CV MSE: 4.7079 +剔除特征索引: 86,置换重要性: -0.0009 +当前特征数: 126, CV MSE: 4.7144 +剔除特征索引: 136,置换重要性: -0.0000 +当前特征数: 125, CV MSE: 4.7590 +剔除特征索引: 113,置换重要性: -0.0013 +当前特征数: 124, CV MSE: 4.8185 +剔除特征索引: 146,置换重要性: -0.0000 +当前特征数: 123, CV MSE: 4.7051 +剔除特征索引: 153,置换重要性: -0.0000 +当前特征数: 122, CV MSE: 4.6048 +剔除特征索引: 138,置换重要性: -0.0006 +当前特征数: 121, CV MSE: 4.5805 +剔除特征索引: 147,置换重要性: -0.0000 +当前特征数: 120, CV MSE: 4.8172 +剔除特征索引: 62,置换重要性: -0.0001 +当前特征数: 119, CV MSE: 4.8393 +剔除特征索引: 127,置换重要性: -0.0002 +当前特征数: 118, CV MSE: 4.7780 +剔除特征索引: 1,置换重要性: 0.0000 +当前特征数: 117, CV MSE: 4.7227 +剔除特征索引: 3,置换重要性: 0.0000 +当前特征数: 116, CV MSE: 4.7864 +剔除特征索引: 4,置换重要性: 0.0000 +当前特征数: 115, CV MSE: 4.6388 +剔除特征索引: 5,置换重要性: 0.0000 +当前特征数: 114, CV MSE: 4.5150 +剔除特征索引: 140,置换重要性: -0.0001 +当前特征数: 113, CV MSE: 4.6324 +剔除特征索引: 6,置换重要性: 0.0000 +当前特征数: 112, CV MSE: 4.7494 +剔除特征索引: 24,置换重要性: -0.0009 +当前特征数: 111, CV MSE: 4.6774 +剔除特征索引: 80,置换重要性: -0.0001 +当前特征数: 110, CV MSE: 4.7463 +剔除特征索引: 70,置换重要性: -0.0000 +当前特征数: 109, CV MSE: 4.7211 +剔除特征索引: 54,置换重要性: -0.0000 +当前特征数: 108, CV MSE: 4.6060 +剔除特征索引: 7,置换重要性: 0.0000 +当前特征数: 107, CV MSE: 4.5989 +剔除特征索引: 88,置换重要性: -0.0001 +当前特征数: 106, CV MSE: 4.5530 +剔除特征索引: 106,置换重要性: -0.0008 +当前特征数: 105, CV MSE: 4.6703 +剔除特征索引: 8,置换重要性: 0.0000 +当前特征数: 104, CV MSE: 4.4274 +剔除特征索引: 145,置换重要性: -0.0000 +当前特征数: 103, CV MSE: 4.7372 +剔除特征索引: 9,置换重要性: 0.0000 +当前特征数: 102, CV MSE: 4.6272 +剔除特征索引: 129,置换重要性: -0.0004 +当前特征数: 101, CV MSE: 4.6820 +剔除特征索引: 10,置换重要性: 0.0000 +当前特征数: 100, CV MSE: 4.5780 +剔除特征索引: 13,置换重要性: -0.0000 +当前特征数: 99, CV MSE: 4.6734 +剔除特征索引: 139,置换重要性: -0.0002 +当前特征数: 98, CV MSE: 4.6922 +剔除特征索引: 50,置换重要性: -0.0000 +当前特征数: 97, CV MSE: 4.6520 +剔除特征索引: 134,置换重要性: -0.0002 +当前特征数: 96, CV MSE: 4.6574 +剔除特征索引: 14,置换重要性: 0.0000 +当前特征数: 95, CV MSE: 4.7120 +剔除特征索引: 15,置换重要性: 0.0000 +当前特征数: 94, CV MSE: 4.6882 +剔除特征索引: 16,置换重要性: 0.0000 +当前特征数: 93, CV MSE: 4.5839 +剔除特征索引: 17,置换重要性: 0.0000 +当前特征数: 92, CV MSE: 4.6231 +剔除特征索引: 18,置换重要性: 0.0000 +当前特征数: 91, CV MSE: 4.6144 +剔除特征索引: 154,置换重要性: 0.0000 +当前特征数: 90, CV MSE: 4.4886 +剔除特征索引: 19,置换重要性: 0.0000 +当前特征数: 89, CV MSE: 4.5828 +剔除特征索引: 20,置换重要性: 0.0000 +当前特征数: 88, CV MSE: 4.4967 +剔除特征索引: 144,置换重要性: -0.0003 +当前特征数: 87, CV MSE: 4.5881 +剔除特征索引: 21,置换重要性: 0.0000 +当前特征数: 86, CV MSE: 4.4972 +剔除特征索引: 22,置换重要性: 0.0000 +当前特征数: 85, CV MSE: 4.5007 +剔除特征索引: 137,置换重要性: -0.0010 +当前特征数: 84, CV MSE: 4.5097 +剔除特征索引: 23,置换重要性: 0.0000 +当前特征数: 83, CV MSE: 4.4847 +剔除特征索引: 25,置换重要性: 0.0000 +当前特征数: 82, CV MSE: 4.6722 +剔除特征索引: 26,置换重要性: 0.0000 +当前特征数: 81, CV MSE: 4.5776 +剔除特征索引: 27,置换重要性: 0.0000 +当前特征数: 80, CV MSE: 4.5790 +剔除特征索引: 28,置换重要性: 0.0000 +当前特征数: 79, CV MSE: 4.5349 +剔除特征索引: 29,置换重要性: 0.0000 +当前特征数: 78, CV MSE: 4.5542 +剔除特征索引: 30,置换重要性: 0.0000 +当前特征数: 77, CV MSE: 4.5299 +剔除特征索引: 31,置换重要性: 0.0000 +当前特征数: 76, CV MSE: 4.6087 +剔除特征索引: 32,置换重要性: 0.0000 +当前特征数: 75, CV MSE: 4.6009 +剔除特征索引: 33,置换重要性: 0.0000 +当前特征数: 74, CV MSE: 4.4684 +剔除特征索引: 34,置换重要性: 0.0000 +当前特征数: 73, CV MSE: 4.5094 +剔除特征索引: 35,置换重要性: 0.0000 +当前特征数: 72, CV MSE: 4.4700 +剔除特征索引: 49,置换重要性: -0.0001 +当前特征数: 71, CV MSE: 4.5176 +剔除特征索引: 36,置换重要性: 0.0000 +当前特征数: 70, CV MSE: 4.4335 +剔除特征索引: 37,置换重要性: -0.0000 +当前特征数: 69, CV MSE: 4.4646 +剔除特征索引: 38,置换重要性: 0.0000 +当前特征数: 68, CV MSE: 4.4792 +剔除特征索引: 39,置换重要性: -0.0000 +当前特征数: 67, CV MSE: 4.4398 +剔除特征索引: 40,置换重要性: 0.0000 +当前特征数: 66, CV MSE: 4.3681 +剔除特征索引: 41,置换重要性: 0.0001 +当前特征数: 65, CV MSE: 4.3967 +剔除特征索引: 42,置换重要性: 0.0000 +当前特征数: 64, CV MSE: 4.4340 +剔除特征索引: 43,置换重要性: 0.0000 +当前特征数: 63, CV MSE: 4.4324 +剔除特征索引: 44,置换重要性: 0.0000 +当前特征数: 62, CV MSE: 4.5365 +剔除特征索引: 45,置换重要性: 0.0000 +当前特征数: 61, CV MSE: 4.5531 +剔除特征索引: 46,置换重要性: 0.0000 +当前特征数: 60, CV MSE: 4.4241 +剔除特征索引: 56,置换重要性: 0.0000 +当前特征数: 59, CV MSE: 4.4180 +剔除特征索引: 57,置换重要性: -0.0000 +当前特征数: 58, CV MSE: 4.3972 +剔除特征索引: 58,置换重要性: 0.0000 +当前特征数: 57, CV MSE: 4.2306 +剔除特征索引: 59,置换重要性: 0.0000 +当前特征数: 56, CV MSE: 4.2263 +剔除特征索引: 61,置换重要性: 0.0000 +当前特征数: 55, CV MSE: 4.4408 +剔除特征索引: 63,置换重要性: 0.0002 +当前特征数: 54, CV MSE: 4.4805 +剔除特征索引: 64,置换重要性: 0.0003 +当前特征数: 53, CV MSE: 4.4570 +剔除特征索引: 65,置换重要性: 0.0000 +当前特征数: 52, CV MSE: 4.4523 +剔除特征索引: 68,置换重要性: -0.0000 +当前特征数: 51, CV MSE: 4.4382 +剔除特征索引: 71,置换重要性: 0.0000 +当前特征数: 50, CV MSE: 4.4065 +剔除特征索引: 72,置换重要性: 0.0000 +当前特征数: 49, CV MSE: 4.3169 +剔除特征索引: 73,置换重要性: 0.0000 +当前特征数: 48, CV MSE: 4.2661 +剔除特征索引: 74,置换重要性: 0.0000 +当前特征数: 47, CV MSE: 4.3875 +剔除特征索引: 75,置换重要性: 0.0004 +当前特征数: 46, CV MSE: 4.3418 +剔除特征索引: 78,置换重要性: -0.0000 +当前特征数: 45, CV MSE: 4.4599 +剔除特征索引: 79,置换重要性: 0.0000 +当前特征数: 44, CV MSE: 4.5502 +剔除特征索引: 81,置换重要性: 0.0000 +当前特征数: 43, CV MSE: 4.4299 +剔除特征索引: 135,置换重要性: 0.0025 +当前特征数: 42, CV MSE: 4.4397 +剔除特征索引: 82,置换重要性: 0.0000 +当前特征数: 41, CV MSE: 4.4255 +剔除特征索引: 83,置换重要性: 0.0003 +当前特征数: 40, CV MSE: 4.4042 +剔除特征索引: 84,置换重要性: 0.0000 +当前特征数: 39, CV MSE: 4.4017 +剔除特征索引: 85,置换重要性: 0.0000 +当前特征数: 38, CV MSE: 4.3463 +剔除特征索引: 87,置换重要性: 0.0003 +当前特征数: 37, CV MSE: 4.5079 +剔除特征索引: 89,置换重要性: -0.0001 +当前特征数: 36, CV MSE: 4.4404 +剔除特征索引: 90,置换重要性: 0.0002 +当前特征数: 35, CV MSE: 4.4802 +剔除特征索引: 91,置换重要性: -0.0000 +当前特征数: 34, CV MSE: 4.5477 +剔除特征索引: 92,置换重要性: 0.0000 +当前特征数: 33, CV MSE: 4.5129 +剔除特征索引: 93,置换重要性: 0.0000 +当前特征数: 32, CV MSE: 4.4643 +剔除特征索引: 94,置换重要性: 0.0004 +当前特征数: 31, CV MSE: 4.6173 +剔除特征索引: 95,置换重要性: 0.0000 +当前特征数: 30, CV MSE: 4.5979 +剔除特征索引: 96,置换重要性: 0.0002 +当前特征数: 29, CV MSE: 4.4120 +剔除特征索引: 97,置换重要性: 0.0000 +当前特征数: 28, CV MSE: 4.4958 +剔除特征索引: 98,置换重要性: 0.0004 +当前特征数: 27, CV MSE: 4.5448 +剔除特征索引: 99,置换重要性: 0.0002 +当前特征数: 26, CV MSE: 4.3483 +剔除特征索引: 100,置换重要性: 0.0014 +当前特征数: 25, CV MSE: 4.3617 +剔除特征索引: 101,置换重要性: 0.0001 +当前特征数: 24, CV MSE: 4.4947 +剔除特征索引: 102,置换重要性: 0.0017 +当前特征数: 23, CV MSE: 4.2328 +剔除特征索引: 103,置换重要性: 0.0022 +当前特征数: 22, CV MSE: 4.3158 +剔除特征索引: 104,置换重要性: 0.0000 +当前特征数: 21, CV MSE: 4.3848 +剔除特征索引: 108,置换重要性: 0.0000 +当前特征数: 20, CV MSE: 4.3861 +剔除特征索引: 109,置换重要性: 0.0009 +当前特征数: 19, CV MSE: 4.1140 +剔除特征索引: 110,置换重要性: 0.0002 +当前特征数: 18, CV MSE: 4.0750 +剔除特征索引: 111,置换重要性: 0.0000 +当前特征数: 17, CV MSE: 3.9148 +剔除特征索引: 112,置换重要性: 0.0004 +当前特征数: 16, CV MSE: 4.0441 +剔除特征索引: 114,置换重要性: 0.0016 +当前特征数: 15, CV MSE: 3.9886 +剔除特征索引: 115,置换重要性: 0.0000 +当前特征数: 14, CV MSE: 4.1169 +剔除特征索引: 116,置换重要性: 0.0000 +当前特征数: 13, CV MSE: 4.1846 +剔除特征索引: 117,置换重要性: 0.0000 +当前特征数: 12, CV MSE: 4.2129 +剔除特征索引: 118,置换重要性: 0.0020 +当前特征数: 11, CV MSE: 4.2006 +剔除特征索引: 120,置换重要性: 0.0024 +当前特征数: 10, CV MSE: 3.9221 +剔除特征索引: 121,置换重要性: 0.0032 +当前特征数: 9, CV MSE: 3.9000 +剔除特征索引: 122,置换重要性: 0.0015 +当前特征数: 8, CV MSE: 4.1202 +剔除特征索引: 123,置换重要性: 0.0020 +当前特征数: 7, CV MSE: 4.0491 +剔除特征索引: 126,置换重要性: 0.0002 +当前特征数: 6, CV MSE: 3.8452 +剔除特征索引: 128,置换重要性: 0.0003 +当前特征数: 5, CV MSE: 3.8135 +剔除特征索引: 131,置换重要性: 0.0013 +当前特征数: 4, CV MSE: 3.8181 +剔除特征索引: 130,置换重要性: 0.0106 +当前特征数: 3, CV MSE: 3.9118 +剔除特征索引: 132,置换重要性: -0.0000 +当前特征数: 2, CV MSE: 3.9524 +剔除特征索引: 141,置换重要性: 0.0000 +当前特征数: 1, CV MSE: 3.9524 + +手动RFECV选择了 5 个特征,CV MSE: 3.8135 +选定特征名称: ['fr_Ndealkylation2', 'fr_aldehyde', 'fr_alkyl_halide', 'fr_halogen', 'fr_piperdine'] +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 34987 (\N{CJK UNIFIED IDEOGRAPH-88AB}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 21076 (\N{CJK UNIFIED IDEOGRAPH-5254}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 38500 (\N{CJK UNIFIED IDEOGRAPH-9664}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 29305 (\N{CJK UNIFIED IDEOGRAPH-7279}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 24449 (\N{CJK UNIFIED IDEOGRAPH-5F81}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 30340 (\N{CJK UNIFIED IDEOGRAPH-7684}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 32622 (\N{CJK UNIFIED IDEOGRAPH-7F6E}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 25442 (\N{CJK UNIFIED IDEOGRAPH-6362}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 37325 (\N{CJK UNIFIED IDEOGRAPH-91CD}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 35201 (\N{CJK UNIFIED IDEOGRAPH-8981}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 24615 (\N{CJK UNIFIED IDEOGRAPH-6027}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 36807 (\N{CJK UNIFIED IDEOGRAPH-8FC7}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 31243 (\N{CJK UNIFIED IDEOGRAPH-7A0B}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 20013 (\N{CJK UNIFIED IDEOGRAPH-4E2D}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 21464 (\N{CJK UNIFIED IDEOGRAPH-53D8}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 21270 (\N{CJK UNIFIED IDEOGRAPH-5316}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 36845 (\N{CJK UNIFIED IDEOGRAPH-8FED}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 20195 (\N{CJK UNIFIED IDEOGRAPH-4EE3}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 27425 (\N{CJK UNIFIED IDEOGRAPH-6B21}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) +/root/project/qsar/1d-qsar/cuda/RFE_cuml_permutation.py:208: UserWarning: Glyph 25968 (\N{CJK UNIFIED IDEOGRAPH-6570}) missing from font(s) DejaVu Sans. + plt.savefig("rfecv_perm_importance.png", dpi=300) + +最终模型(最佳特征、n_estimators=350)CV MSE: 3.8135 + + +在 RFECV 的过程中,并不是直接选择“置换重要性最高”的特征,而是逐步剔除那些对模型贡献较小(即剔除后 CV MSE 变化最小)的特征。最终留下的这 5 个特征,正是当它们保留时能让交叉验证的均方误差最低的那一组特征组合。 + +换句话说,这 5 个特征并不是单独“置换重要性很大”,而是删除其他特征后,模型的预测性能下降得比较多。如果删除其中任一特征,CV MSE 会升高,因此保留它们可以达到最佳性能。所以最终只剩下这 5 个特征,是因为它们在整体组合上对预测任务最为关键。 \ No newline at end of file diff --git a/1d-qsar/cuda/RFE_cuml_permutation.py b/1d-qsar/cuda/RFE_cuml_permutation.py new file mode 100644 index 0000000..3b09858 --- /dev/null +++ b/1d-qsar/cuda/RFE_cuml_permutation.py @@ -0,0 +1,217 @@ +#!/usr/bin/env python +# -*- encoding: utf-8 -*- +''' +@file: RFE_cuml_permutation.py +@Description: 使用GPU cuML实现特征选择,手动实现RFECV,并通过置换重要性计算来剔除最不重要的特征, + 同时通过交叉验证确定合理的随机森林树数量(动态扩展直到CV MSE收敛), + 并将结果保存为图像,最后使用最佳树数进行训练。 +@Date: 2025/03/01 +@Author: lyzeng(修改版) +''' + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt + +# --------------------------- +# 数据加载与特征构造 +# --------------------------- +data = pd.read_csv("../data_smi.csv", index_col="Entry ID") +target_data = data[["SMILES", "S.aureus ATCC25923"]] +target_data.columns = ["SMILES", "TARGET"] + +from rdkit import Chem +from rdkit.Chem import Descriptors + +def getMolDescriptors(smiles: str, missingVal=None): + mol = Chem.MolFromSmiles(smiles) + res = {} + for nm, fn in Descriptors._descList: + try: + val = fn(mol) + except Exception as e: + val = missingVal + res[nm] = val + return res + +# 构建特征矩阵与目标变量 +X_df = pd.DataFrame([getMolDescriptors(smi) for smi in target_data['SMILES']]) +y = target_data['TARGET'].values + +# 剔除无效特征(取值唯一的列) +invalid_features = [col for col in X_df.columns if X_df[col].nunique() <= 1] +X_df.drop(columns=invalid_features, inplace=True) +X = X_df.values + +print(f"训练样本数: {len(y)}, 特征数量: {X.shape[1]}") + +# --------------------------- +# 转换为GPU数据格式(cupy数组) +# --------------------------- +import cupy as cp +X_gpu = cp.asarray(X) +y_gpu = cp.asarray(y) + +# --------------------------- +# 定义辅助函数:计算模型的MSE +# --------------------------- +def model_mse(model, X_data, y_data): + y_pred = model.predict(X_data) + mse = cp.mean((y_data - y_pred) ** 2) + return mse.get() + +# --------------------------- +# 定义置换重要性计算函数(在GPU上) +# --------------------------- +def permutation_importance_gpu(model, X_data, y_data, n_repeats=3, random_state=0): + """ + 对于给定的模型,计算每个特征的置换重要性: + 重要性 = (打乱特征后的MSE均值) - (原始MSE) + 随机打乱每个特征的值,并观察模型性能的下降程度来衡量特征的重要性。 + """ + X_cpu = cp.asnumpy(X_data) # 将数据复制到CPU进行打乱 + baseline = model_mse(model, X_data, y_data) + importances = np.zeros(X_cpu.shape[1]) + rng = np.random.RandomState(random_state) + for j in range(X_cpu.shape[1]): + scores = [] + for _ in range(n_repeats): + X_permuted = X_cpu.copy() + rng.shuffle(X_permuted[:, j]) + X_permuted_gpu = cp.asarray(X_permuted) + score = model_mse(model, X_permuted_gpu, y_data) + scores.append(score) + importances[j] = np.mean(scores) - baseline + return importances + +# --------------------------- +# 修改交叉验证函数:允许传入树的数量 +# --------------------------- +from cuml.ensemble import RandomForestRegressor as cuRF +from sklearn.model_selection import KFold + +def cross_val_score_gpu(X_data, y_data, cv=5, n_estimators=100): + """ + 使用 KFold 分割数据,利用 cuML 随机森林评估当前特征子集的CV均方误差。 + 允许设置随机森林的树数量 n_estimators。 + """ + kf = KFold(n_splits=cv, shuffle=True, random_state=0) + mse_scores = [] + X_np = cp.asnumpy(X_data) # 使用CPU索引进行分割 + for train_index, test_index in kf.split(X_np): + X_train = X_data[train_index, :] + X_test = X_data[test_index, :] + y_train = y_data[train_index] + y_test = y_data[test_index] + model = cuRF(n_estimators=n_estimators, random_state=0) + model.fit(X_train, y_train) + mse = model_mse(model, X_test, y_test) + mse_scores.append(mse) + return np.mean(mse_scores) + +# --------------------------- +# 动态评估随机森林树数量(扩展候选数直到CV MSE收敛) +# --------------------------- +def evaluate_n_estimators_dynamic(X_data, y_data, cv=5, start=50, step=50, threshold=1e-3, max_estimators=1000): + """ + 从 start 开始,每次增加 step,直到连续两次 CV MSE 的差异小于 threshold, + 或达到 max_estimators。返回字典 {树数量: CV MSE}。 + """ + results = {} + candidate = start + prev_mse = None + while candidate <= max_estimators: + mse = cross_val_score_gpu(X_data, y_data, cv=cv, n_estimators=candidate) + results[candidate] = mse + print(f"n_estimators: {candidate}, CV MSE: {mse:.4f}") + if prev_mse is not None and abs(prev_mse - mse) < threshold: + break + prev_mse = mse + candidate += step + return results + +# 调用动态评估函数 +tree_results = evaluate_n_estimators_dynamic(X_gpu, y_gpu, cv=5, start=50, step=50, threshold=1e-3, max_estimators=1000) + +# 绘制树数量与 CV MSE 关系图并保存 +plt.figure(figsize=(8, 5)) +plt.plot(list(tree_results.keys()), list(tree_results.values()), marker='o') +plt.xlabel('Random Forest 树数量 (n_estimators)') +plt.ylabel('CV MSE') +plt.title('不同树数量下的交叉验证MSE') +plt.grid(True) +plt.savefig("tree_vs_cv_mse.png", dpi=300) +plt.close() + +# 确定最佳的树数量(CV MSE最小对应的树数) +best_n_estimators = min(tree_results, key=tree_results.get) +print(f"最佳随机森林树数量确定为: {best_n_estimators}") + +# --------------------------- +# 手动实现 RFECV,并记录置换重要性变化过程 +# 这里传入最佳树数量 best_n_estimators 以用于模型训练 +# --------------------------- +def manual_rfecv(X_data, y_data, cv=5, n_estimators=100): + """ + 手动实现 RFECV: + - 在当前特征子集上计算CV MSE + - 使用置换重要性评估各特征的重要性 + - 移除置换重要性最低的特征,并记录被移除特征的置换重要性 + - 记录使CV MSE最小的特征组合 + 返回最佳特征组合、最佳CV MSE、历史记录及置换重要性历史。 + """ + n_features = X_data.shape[1] + features = list(range(n_features)) + best_score = float('inf') + best_features = features.copy() + scores_history = [] + perm_imp_history = [] # 记录每次剔除的最小置换重要性 + + while len(features) > 0: + current_score = cross_val_score_gpu(X_data[:, features], y_data, cv=cv, n_estimators=n_estimators) + scores_history.append((features.copy(), current_score)) + print(f"当前特征数: {len(features)}, CV MSE: {current_score:.4f}") + if current_score < best_score: + best_score = current_score + best_features = features.copy() + if len(features) == 1: + break + # 训练模型并计算置换重要性 + model = cuRF(n_estimators=n_estimators, random_state=0) + model.fit(X_data[:, features], y_data) + imp = permutation_importance_gpu(model, X_data[:, features], y_data, n_repeats=3, random_state=0) + idx_to_remove = int(np.argmin(imp)) + removed_feature = features[idx_to_remove] + removed_imp = imp[idx_to_remove] + perm_imp_history.append(removed_imp) + print(f"剔除特征索引: {removed_feature},置换重要性: {removed_imp:.4f}") + del features[idx_to_remove] + + return best_features, best_score, scores_history, perm_imp_history + +# 执行手动 RFECV,使用5折交叉验证和最佳树数 +cv_folds = 5 +best_features_rfecv, best_mse_rfecv, history, perm_imp_history = manual_rfecv(X_gpu, y_gpu, cv=cv_folds, n_estimators=best_n_estimators) + +print(f"\n手动RFECV选择了 {len(best_features_rfecv)} 个特征,CV MSE: {best_mse_rfecv:.4f}") +selected_feature_names = [X_df.columns[i] for i in best_features_rfecv] +print("选定特征名称:", selected_feature_names) + +# 绘制置换重要性变化图并保存 +plt.figure(figsize=(8, 5)) +iterations = list(range(1, len(perm_imp_history) + 1)) +plt.plot(iterations, perm_imp_history, marker='o') +plt.xlabel('RFECV 迭代次数') +plt.ylabel('被剔除特征的置换重要性') +plt.title('RFECV过程中置换重要性变化') +plt.grid(True) +plt.savefig("rfecv_perm_importance.png", dpi=300) +plt.close() + +# --------------------------- +# 最终模型训练:使用最佳特征和最佳随机森林树数量训练最终模型 +# --------------------------- +final_model = cuRF(n_estimators=best_n_estimators, random_state=0) +final_model.fit(X_gpu[:, best_features_rfecv], y_gpu) +final_cv_mse = cross_val_score_gpu(X_gpu[:, best_features_rfecv], y_gpu, cv=cv_folds, n_estimators=best_n_estimators) +print(f"\n最终模型(最佳特征、n_estimators={best_n_estimators})CV MSE: {final_cv_mse:.4f}") diff --git a/1d-qsar/cuda/RFE_cuml_permutation_update.log b/1d-qsar/cuda/RFE_cuml_permutation_update.log new file mode 100644 index 0000000..0636052 --- /dev/null +++ b/1d-qsar/cuda/RFE_cuml_permutation_update.log @@ -0,0 +1,38 @@ +nohup: ignoring input +Number of training samples: 81, Number of features: 156 +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:363: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set + return init_func(self, *args, **kwargs) +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188: UserWarning: The number of bins, `n_bins` is greater than the number of samples used for training. Changing `n_bins` to number of training samples. + ret = func(*args, **kwargs) +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188: UserWarning: To use pickling first train using float32 data to fit the estimator + ret = func(*args, **kwargs) +n_estimators: 50, CV MSE: 5.2602 +n_estimators: 100, CV MSE: 5.0974 +n_estimators: 150, CV MSE: 5.1422 +n_estimators: 200, CV MSE: 5.0988 +n_estimators: 250, CV MSE: 4.9330 +n_estimators: 300, CV MSE: 4.9259 +n_estimators: 350, CV MSE: 4.9253 +Optimal number of trees determined: 350 +Current number of features: 156, CV MSE: 4.9253 +Removed feature index: 133, feature name: fr_allylic_oxid, permutation importance: -0.0004 +Current number of features: 155, CV MSE: 4.9970 +Removed feature index: 124, feature name: fr_Ar_N, permutation importance: -0.0009 +Current number of features: 154, CV MSE: 4.7611 +Removed feature index: 152, feature name: fr_pyridine, permutation importance: -0.0006 +Current number of features: 153, CV MSE: 4.7928 +Removed feature index: 47, feature name: PEOE_VSA10, permutation importance: -0.0001 +Current number of features: 152, CV MSE: 4.7032 +Removed feature index: 107, feature name: NumAromaticHeterocycles, permutation importance: -0.0010 +Current number of features: 151, CV MSE: 4.7571 +Removed feature index: 155, feature name: fr_unbrch_alkane, permutation importance: -0.0000 +Current number of features: 150, CV MSE: 4.7501 +Removed feature index: 11, feature name: MinPartialCharge, permutation importance: -0.0004 +Current number of features: 149, CV MSE: 4.8415 +Removed feature index: 52, feature name: PEOE_VSA2, permutation importance: -0.0001 +Current number of features: 148, CV MSE: 4.7807 +Removed feature index: 143, feature name: fr_ketone, permutation importance: -0.0001 +Current number of features: 147, CV MSE: 4.7937 +Removed feature index: 12, feature name: MaxAbsPartialCharge, permutation importance: -0.0008 +Current number of features: 146, CV MSE: 4.7589 +Removed feature index: 2, feature name: MinAbsEStateIndex, permutation importance: -0.0007 diff --git a/1d-qsar/cuda/RFE_cuml_permutation_update.py b/1d-qsar/cuda/RFE_cuml_permutation_update.py new file mode 100644 index 0000000..29c6f0b --- /dev/null +++ b/1d-qsar/cuda/RFE_cuml_permutation_update.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python +# -*- encoding: utf-8 -*- +''' +@file: RFE_cuml_permutation.py +@Description: Use GPU cuML to perform feature selection via a manual RFECV procedure, + removing the least important features based on permutation importance, + dynamically determining the optimal number of trees (until CV MSE converges), + saving the plots, and finally training the final model using the best n_estimators. +@Date: 2025/03/01 +@Author: lyzeng (modified) +nohup python3 -u RFE_cuml_permutation_update.py > RFE_cuml_permutation_update.log 2>&1 & +''' + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt + +# --------------------------- +# Data loading and feature construction +# --------------------------- +data = pd.read_csv("../data_smi.csv", index_col="Entry ID") +target_data = data[["SMILES", "S.aureus ATCC25923"]] +target_data.columns = ["SMILES", "TARGET"] + +from rdkit import Chem +from rdkit.Chem import Descriptors + +def getMolDescriptors(smiles: str, missingVal=None): + mol = Chem.MolFromSmiles(smiles) + res = {} + for nm, fn in Descriptors._descList: + try: + val = fn(mol) + except Exception as e: + val = missingVal + res[nm] = val + return res + +# Construct feature matrix and target variable +X_df = pd.DataFrame([getMolDescriptors(smi) for smi in target_data['SMILES']]) +y = target_data['TARGET'].values + +# Remove invalid features (columns with only one unique value) +invalid_features = [col for col in X_df.columns if X_df[col].nunique() <= 1] +X_df.drop(columns=invalid_features, inplace=True) +X = X_df.values + +print(f"Number of training samples: {len(y)}, Number of features: {X.shape[1]}") + +# --------------------------- +# Convert data to GPU arrays (cupy arrays) +# --------------------------- +import cupy as cp +X_gpu = cp.asarray(X) +y_gpu = cp.asarray(y) + +# --------------------------- +# Helper function: Calculate model MSE +# --------------------------- +def model_mse(model, X_data, y_data): + y_pred = model.predict(X_data) + mse = cp.mean((y_data - y_pred) ** 2) + return mse.get() + +# --------------------------- +# Permutation importance function (GPU version) +# --------------------------- +def permutation_importance_gpu(model, X_data, y_data, n_repeats=3, random_state=0): + """ + For the given model, compute permutation importance for each feature: + Importance = (Mean MSE after permuting the feature) - (Original MSE) + The function randomly shuffles each feature and observes the change in model performance. + """ + X_cpu = cp.asnumpy(X_data) # Copy data to CPU for shuffling + baseline = model_mse(model, X_data, y_data) + importances = np.zeros(X_cpu.shape[1]) + rng = np.random.RandomState(random_state) + for j in range(X_cpu.shape[1]): + scores = [] + for _ in range(n_repeats): + X_permuted = X_cpu.copy() + rng.shuffle(X_permuted[:, j]) + X_permuted_gpu = cp.asarray(X_permuted) + score = model_mse(model, X_permuted_gpu, y_data) + scores.append(score) + importances[j] = np.mean(scores) - baseline + return importances + +# --------------------------- +# Modified cross-validation function: allow setting number of trees +# --------------------------- +from cuml.ensemble import RandomForestRegressor as cuRF +from sklearn.model_selection import KFold + +def cross_val_score_gpu(X_data, y_data, cv=5, n_estimators=100): + """ + Use KFold to split the data and evaluate the CV MSE of the current feature subset + using cuML RandomForestRegressor with a specified number of trees. + """ + kf = KFold(n_splits=cv, shuffle=True, random_state=0) + mse_scores = [] + X_np = cp.asnumpy(X_data) # Use CPU indices for splitting + for train_index, test_index in kf.split(X_np): + X_train = X_data[train_index, :] + X_test = X_data[test_index, :] + y_train = y_data[train_index] + y_test = y_data[test_index] + model = cuRF(n_estimators=n_estimators, random_state=0) + model.fit(X_train, y_train) + mse = model_mse(model, X_test, y_test) + mse_scores.append(mse) + return np.mean(mse_scores) + +# --------------------------- +# Dynamically evaluate the number of trees (expand candidate numbers until CV MSE converges) +# --------------------------- +def evaluate_n_estimators_dynamic(X_data, y_data, cv=5, start=50, step=50, threshold=1e-3, max_estimators=1000): + """ + Starting from 'start', increase by 'step' until the difference in consecutive CV MSE is below 'threshold' + or until max_estimators is reached. Returns a dict {n_estimators: CV MSE}. + """ + results = {} + candidate = start + prev_mse = None + while candidate <= max_estimators: + mse = cross_val_score_gpu(X_data, y_data, cv=cv, n_estimators=candidate) + results[candidate] = mse + print(f"n_estimators: {candidate}, CV MSE: {mse:.4f}") + if prev_mse is not None and abs(prev_mse - mse) < threshold: + break + prev_mse = mse + candidate += step + return results + +# Evaluate the optimal number of trees +tree_results = evaluate_n_estimators_dynamic(X_gpu, y_gpu, cv=5, start=50, step=50, threshold=1e-3, max_estimators=1000) + +# Plot CV MSE vs Number of Trees and save the figure +plt.figure(figsize=(8, 5)) +plt.plot(list(tree_results.keys()), list(tree_results.values()), marker='o') +plt.xlabel('Random Forest n_estimators') +plt.ylabel('CV MSE') +plt.title('CV MSE vs Number of Trees') +plt.grid(True) +plt.savefig("tree_vs_cv_mse.png", dpi=300) +plt.close() + +# Determine the optimal number of trees (the one with the lowest CV MSE) +best_n_estimators = min(tree_results, key=tree_results.get) +print(f"Optimal number of trees determined: {best_n_estimators}") + +# --------------------------- +# Manual RFECV implementation with permutation importance +# Now includes printing the feature name (from rdkit Descriptors) along with the index. +# Also records removed feature names for plotting annotation. +# --------------------------- +def manual_rfecv(X_data, y_data, feature_names, cv=5, n_estimators=100): + """ + Manual RFECV: + - Compute CV MSE on the current feature subset. + - Compute permutation importance for each feature. + - Remove the feature with the lowest permutation importance and record it. + - Record the feature subset that yields the lowest CV MSE. + Returns the best feature subset, best CV MSE, history, permutation importance history, + and the list of removed feature names (in the order of removal). + """ + n_features = X_data.shape[1] + features = list(range(n_features)) + best_score = float('inf') + best_features = features.copy() + scores_history = [] + perm_imp_history = [] # Record the permutation importance of each removed feature + removed_feature_names = [] # Record the corresponding feature names + + while len(features) > 0: + current_score = cross_val_score_gpu(X_data[:, features], y_data, cv=cv, n_estimators=n_estimators) + scores_history.append((features.copy(), current_score)) + print(f"Current number of features: {len(features)}, CV MSE: {current_score:.4f}") + if current_score < best_score: + best_score = current_score + best_features = features.copy() + if len(features) == 1: + break + # Train the model and compute permutation importance + model = cuRF(n_estimators=n_estimators, random_state=0) + model.fit(X_data[:, features], y_data) + imp = permutation_importance_gpu(model, X_data[:, features], y_data, n_repeats=3, random_state=0) + idx_to_remove = int(np.argmin(imp)) + removed_feature = features[idx_to_remove] + removed_imp = imp[idx_to_remove] + # Get the feature name using the original feature_names list + removed_feature_name = feature_names[removed_feature] + perm_imp_history.append(removed_imp) + removed_feature_names.append(removed_feature_name) + print(f"Removed feature index: {removed_feature}, feature name: {removed_feature_name}, permutation importance: {removed_imp:.4f}") + del features[idx_to_remove] + + return best_features, best_score, scores_history, perm_imp_history, removed_feature_names + +# Execute manual RFECV using 5-fold CV and the optimal number of trees, also pass the feature names +cv_folds = 5 +best_features_rfecv, best_mse_rfecv, history, perm_imp_history, removed_feature_names = manual_rfecv( + X_gpu, y_gpu, feature_names=list(X_df.columns), cv=cv_folds, n_estimators=best_n_estimators +) + +print(f"\nManual RFECV selected {len(best_features_rfecv)} features, CV MSE: {best_mse_rfecv:.4f}") +selected_feature_names = [X_df.columns[i] for i in best_features_rfecv] +print("Selected feature names:", selected_feature_names) + +# Plot permutation importance change during RFECV with English labels +plt.figure(figsize=(8, 5)) +iterations = list(range(1, len(perm_imp_history) + 1)) +plt.plot(iterations, perm_imp_history, marker='o') +plt.xlabel('RFECV Iteration') +plt.ylabel('Permutation Importance of Removed Feature') +plt.title('Permutation Importance during RFECV') +plt.grid(True) + +# Annotation threshold: annotate points where absolute permutation importance > 0.002 +annotation_threshold = 0.002 +for i, imp_val in enumerate(perm_imp_history): + if abs(imp_val) > annotation_threshold: + plt.annotate(removed_feature_names[i], (iterations[i], imp_val), textcoords="offset points", xytext=(0,5), ha='center') + +plt.savefig("rfecv_perm_importance.png", dpi=300) +plt.close() + +# --------------------------- +# Final model training: train final model using best feature subset and optimal number of trees +# --------------------------- +final_model = cuRF(n_estimators=best_n_estimators, random_state=0) +final_model.fit(X_gpu[:, best_features_rfecv], y_gpu) +final_cv_mse = cross_val_score_gpu(X_gpu[:, best_features_rfecv], y_gpu, cv=cv_folds, n_estimators=best_n_estimators) +print(f"\nFinal model (selected features, n_estimators={best_n_estimators}) CV MSE: {final_cv_mse:.4f}") diff --git a/1d-qsar/cuda/feature.log b/1d-qsar/cuda/feature.log new file mode 100644 index 0000000..4d06789 --- /dev/null +++ b/1d-qsar/cuda/feature.log @@ -0,0 +1,322 @@ +(rapids-25.02) root@DESK4090:~/project/qsar/1d-qsar/cuda# python qssar1d.py +训练样本数: 81, 特征数量: 156 +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:363: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set + return init_func(self, *args, **kwargs) +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188: UserWarning: The number of bins, `n_bins` is greater than the number of samples used for training. Changing `n_bins` to number of training samples. + ret = func(*args, **kwargs) +/root/micromamba/envs/rapids-25.02/lib/python3.12/site-packages/cuml/internals/api_decorators.py:188: UserWarning: To use pickling first train using float32 data to fit the estimator + ret = func(*args, **kwargs) +当前特征数: 156, CV MSE: 5.0974 +剔除特征索引: 13,置换重要性: -0.0022 +当前特征数: 155, CV MSE: 5.2003 +剔除特征索引: 133,置换重要性: -0.0016 +当前特征数: 154, CV MSE: 5.1318 +剔除特征索引: 107,置换重要性: -0.0188 +当前特征数: 153, CV MSE: 5.1432 +剔除特征索引: 77,置换重要性: -0.0082 +当前特征数: 152, CV MSE: 5.1745 +剔除特征索引: 126,置换重要性: -0.0005 +当前特征数: 151, CV MSE: 4.9942 +剔除特征索引: 106,置换重要性: -0.0072 +当前特征数: 150, CV MSE: 5.0801 +剔除特征索引: 108,置换重要性: -0.0077 +当前特征数: 149, CV MSE: 4.9464 +剔除特征索引: 101,置换重要性: -0.0008 +当前特征数: 148, CV MSE: 4.7651 +剔除特征索引: 67,置换重要性: -0.0028 +当前特征数: 147, CV MSE: 4.9519 +剔除特征索引: 62,置换重要性: -0.0005 +当前特征数: 146, CV MSE: 4.8742 +剔除特征索引: 86,置换重要性: -0.0032 +当前特征数: 145, CV MSE: 4.7400 +剔除特征索引: 119,置换重要性: -0.0016 +当前特征数: 144, CV MSE: 4.7471 +剔除特征索引: 148,置换重要性: -0.0018 +当前特征数: 143, CV MSE: 4.8387 +剔除特征索引: 138,置换重要性: -0.0006 +当前特征数: 142, CV MSE: 5.0176 +剔除特征索引: 143,置换重要性: -0.0008 +当前特征数: 141, CV MSE: 4.8092 +剔除特征索引: 124,置换重要性: -0.0070 +当前特征数: 140, CV MSE: 4.8574 +剔除特征索引: 111,置换重要性: -0.0046 +当前特征数: 139, CV MSE: 5.0330 +剔除特征索引: 128,置换重要性: -0.0012 +当前特征数: 138, CV MSE: 5.0798 +剔除特征索引: 139,置换重要性: -0.0009 +当前特征数: 137, CV MSE: 4.9377 +剔除特征索引: 52,置换重要性: -0.0004 +当前特征数: 136, CV MSE: 5.1192 +剔除特征索引: 155,置换重要性: -0.0000 +当前特征数: 135, CV MSE: 5.2678 +剔除特征索引: 152,置换重要性: -0.0040 +当前特征数: 134, CV MSE: 4.8579 +剔除特征索引: 93,置换重要性: -0.0031 +当前特征数: 133, CV MSE: 4.8737 +剔除特征索引: 125,置换重要性: -0.0000 +当前特征数: 132, CV MSE: 5.1150 +剔除特征索引: 84,置换重要性: -0.0009 +当前特征数: 131, CV MSE: 5.0172 +剔除特征索引: 10,置换重要性: -0.0006 +当前特征数: 130, CV MSE: 5.0235 +剔除特征索引: 21,置换重要性: -0.0018 +当前特征数: 129, CV MSE: 4.7640 +剔除特征索引: 76,置换重要性: -0.0007 +当前特征数: 128, CV MSE: 4.8049 +剔除特征索引: 74,置换重要性: -0.0037 +当前特征数: 127, CV MSE: 4.7546 +剔除特征索引: 55,置换重要性: -0.0003 +当前特征数: 126, CV MSE: 4.7790 +剔除特征索引: 66,置换重要性: -0.0018 +当前特征数: 125, CV MSE: 4.9159 +剔除特征索引: 80,置换重要性: -0.0003 +当前特征数: 124, CV MSE: 4.8962 +剔除特征索引: 115,置换重要性: -0.0012 +当前特征数: 123, CV MSE: 4.7899 +剔除特征索引: 53,置换重要性: -0.0000 +当前特征数: 122, CV MSE: 4.9718 +剔除特征索引: 27,置换重要性: -0.0007 +当前特征数: 121, CV MSE: 4.7772 +剔除特征索引: 12,置换重要性: -0.0009 +当前特征数: 120, CV MSE: 4.9935 +剔除特征索引: 49,置换重要性: -0.0002 +当前特征数: 119, CV MSE: 4.7858 +剔除特征索引: 47,置换重要性: -0.0002 +当前特征数: 118, CV MSE: 4.8866 +剔除特征索引: 113,置换重要性: -0.0019 +当前特征数: 117, CV MSE: 4.6379 +剔除特征索引: 63,置换重要性: -0.0002 +当前特征数: 116, CV MSE: 4.7249 +剔除特征索引: 65,置换重要性: -0.0002 +当前特征数: 115, CV MSE: 4.9689 +剔除特征索引: 88,置换重要性: -0.0005 +当前特征数: 114, CV MSE: 4.8946 +剔除特征索引: 48,置换重要性: -0.0004 +当前特征数: 113, CV MSE: 4.7893 +剔除特征索引: 11,置换重要性: -0.0007 +当前特征数: 112, CV MSE: 4.7882 +剔除特征索引: 15,置换重要性: -0.0035 +当前特征数: 111, CV MSE: 4.6776 +剔除特征索引: 24,置换重要性: -0.0027 +当前特征数: 110, CV MSE: 4.8426 +剔除特征索引: 51,置换重要性: -0.0000 +当前特征数: 109, CV MSE: 4.9341 +剔除特征索引: 127,置换重要性: -0.0014 +当前特征数: 108, CV MSE: 4.7765 +剔除特征索引: 31,置换重要性: -0.0006 +当前特征数: 107, CV MSE: 4.9951 +剔除特征索引: 36,置换重要性: -0.0010 +当前特征数: 106, CV MSE: 5.2174 +剔除特征索引: 32,置换重要性: -0.0009 +当前特征数: 105, CV MSE: 4.7904 +剔除特征索引: 140,置换重要性: -0.0012 +当前特征数: 104, CV MSE: 5.0252 +剔除特征索引: 9,置换重要性: -0.0004 +当前特征数: 103, CV MSE: 4.9320 +剔除特征索引: 70,置换重要性: -0.0001 +当前特征数: 102, CV MSE: 5.2598 +剔除特征索引: 25,置换重要性: -0.0008 +当前特征数: 101, CV MSE: 4.9394 +剔除特征索引: 54,置换重要性: -0.0001 +当前特征数: 100, CV MSE: 4.9273 +剔除特征索引: 16,置换重要性: -0.0007 +当前特征数: 99, CV MSE: 4.8924 +剔除特征索引: 42,置换重要性: -0.0020 +当前特征数: 98, CV MSE: 5.0755 +剔除特征索引: 150,置换重要性: -0.0000 +当前特征数: 97, CV MSE: 4.8551 +剔除特征索引: 116,置换重要性: -0.0013 +当前特征数: 96, CV MSE: 4.9251 +剔除特征索引: 153,置换重要性: -0.0000 +当前特征数: 95, CV MSE: 4.9058 +剔除特征索引: 0,置换重要性: -0.0001 +当前特征数: 94, CV MSE: 4.8297 +剔除特征索引: 144,置换重要性: -0.0016 +当前特征数: 93, CV MSE: 4.7873 +剔除特征索引: 19,置换重要性: -0.0041 +当前特征数: 92, CV MSE: 4.9690 +剔除特征索引: 105,置换重要性: -0.0003 +当前特征数: 91, CV MSE: 4.8730 +剔除特征索引: 136,置换重要性: -0.0001 +当前特征数: 90, CV MSE: 4.7316 +剔除特征索引: 154,置换重要性: -0.0000 +当前特征数: 89, CV MSE: 4.6072 +剔除特征索引: 145,置换重要性: -0.0000 +当前特征数: 88, CV MSE: 4.4552 +剔除特征索引: 29,置换重要性: -0.0018 +当前特征数: 87, CV MSE: 4.7021 +剔除特征索引: 149,置换重要性: -0.0000 +当前特征数: 86, CV MSE: 4.4475 +剔除特征索引: 129,置换重要性: -0.0000 +当前特征数: 85, CV MSE: 4.6533 +剔除特征索引: 1,置换重要性: 0.0000 +当前特征数: 84, CV MSE: 4.3199 +剔除特征索引: 135,置换重要性: -0.0003 +当前特征数: 83, CV MSE: 4.2181 +剔除特征索引: 142,置换重要性: -0.0000 +当前特征数: 82, CV MSE: 4.4828 +剔除特征索引: 69,置换重要性: -0.0001 +当前特征数: 81, CV MSE: 4.4320 +剔除特征索引: 137,置换重要性: -0.0019 +当前特征数: 80, CV MSE: 4.5423 +剔除特征索引: 2,置换重要性: 0.0000 +当前特征数: 79, CV MSE: 4.5011 +剔除特征索引: 3,置换重要性: 0.0000 +当前特征数: 78, CV MSE: 4.4884 +剔除特征索引: 64,置换重要性: -0.0018 +当前特征数: 77, CV MSE: 4.6200 +剔除特征索引: 4,置换重要性: 0.0000 +当前特征数: 76, CV MSE: 4.5752 +剔除特征索引: 5,置换重要性: 0.0000 +当前特征数: 75, CV MSE: 4.4515 +剔除特征索引: 41,置换重要性: -0.0010 +当前特征数: 74, CV MSE: 4.6762 +剔除特征索引: 6,置换重要性: 0.0000 +当前特征数: 73, CV MSE: 4.3106 +剔除特征索引: 7,置换重要性: 0.0000 +当前特征数: 72, CV MSE: 4.1087 +剔除特征索引: 147,置换重要性: -0.0000 +当前特征数: 71, CV MSE: 4.4017 +剔除特征索引: 91,置换重要性: -0.0030 +当前特征数: 70, CV MSE: 4.3070 +剔除特征索引: 146,置换重要性: -0.0000 +当前特征数: 69, CV MSE: 4.5710 +剔除特征索引: 8,置换重要性: 0.0000 +当前特征数: 68, CV MSE: 4.3590 +剔除特征索引: 122,置换重要性: -0.0013 +当前特征数: 67, CV MSE: 4.3564 +剔除特征索引: 14,置换重要性: 0.0000 +当前特征数: 66, CV MSE: 4.5206 +剔除特征索引: 134,置换重要性: -0.0004 +当前特征数: 65, CV MSE: 4.6562 +剔除特征索引: 60,置换重要性: -0.0038 +当前特征数: 64, CV MSE: 4.4382 +剔除特征索引: 17,置换重要性: 0.0000 +当前特征数: 63, CV MSE: 4.5168 +剔除特征索引: 18,置换重要性: -0.0001 +当前特征数: 62, CV MSE: 4.6560 +剔除特征索引: 20,置换重要性: 0.0000 +当前特征数: 61, CV MSE: 4.6232 +剔除特征索引: 22,置换重要性: 0.0000 +当前特征数: 60, CV MSE: 4.5546 +剔除特征索引: 23,置换重要性: 0.0000 +当前特征数: 59, CV MSE: 4.6368 +剔除特征索引: 26,置换重要性: -0.0000 +当前特征数: 58, CV MSE: 4.7763 +剔除特征索引: 28,置换重要性: 0.0000 +当前特征数: 57, CV MSE: 4.5347 +剔除特征索引: 30,置换重要性: -0.0001 +当前特征数: 56, CV MSE: 4.5139 +剔除特征索引: 33,置换重要性: 0.0000 +当前特征数: 55, CV MSE: 4.4100 +剔除特征索引: 34,置换重要性: 0.0000 +当前特征数: 54, CV MSE: 4.3154 +剔除特征索引: 35,置换重要性: 0.0000 +当前特征数: 53, CV MSE: 4.4305 +剔除特征索引: 37,置换重要性: 0.0000 +当前特征数: 52, CV MSE: 4.5358 +剔除特征索引: 50,置换重要性: 0.0000 +当前特征数: 51, CV MSE: 4.5094 +剔除特征索引: 38,置换重要性: 0.0022 +当前特征数: 50, CV MSE: 4.5153 +剔除特征索引: 39,置换重要性: -0.0004 +当前特征数: 49, CV MSE: 4.3441 +剔除特征索引: 40,置换重要性: -0.0001 +当前特征数: 48, CV MSE: 4.4674 +剔除特征索引: 43,置换重要性: 0.0000 +当前特征数: 47, CV MSE: 4.3391 +剔除特征索引: 131,置换重要性: -0.0003 +当前特征数: 46, CV MSE: 4.6788 +剔除特征索引: 44,置换重要性: 0.0000 +当前特征数: 45, CV MSE: 4.2720 +剔除特征索引: 45,置换重要性: 0.0000 +当前特征数: 44, CV MSE: 4.3232 +剔除特征索引: 46,置换重要性: 0.0000 +当前特征数: 43, CV MSE: 4.1793 +剔除特征索引: 56,置换重要性: 0.0000 +当前特征数: 42, CV MSE: 4.4771 +剔除特征索引: 57,置换重要性: 0.0000 +当前特征数: 41, CV MSE: 4.1601 +剔除特征索引: 58,置换重要性: 0.0000 +当前特征数: 40, CV MSE: 4.3411 +剔除特征索引: 59,置换重要性: 0.0000 +当前特征数: 39, CV MSE: 4.3668 +剔除特征索引: 61,置换重要性: -0.0000 +当前特征数: 38, CV MSE: 4.3295 +剔除特征索引: 68,置换重要性: 0.0000 +当前特征数: 37, CV MSE: 4.4400 +剔除特征索引: 71,置换重要性: -0.0000 +当前特征数: 36, CV MSE: 4.1841 +剔除特征索引: 72,置换重要性: 0.0000 +当前特征数: 35, CV MSE: 4.2940 +剔除特征索引: 73,置换重要性: 0.0000 +当前特征数: 34, CV MSE: 4.1931 +剔除特征索引: 75,置换重要性: 0.0000 +当前特征数: 33, CV MSE: 4.0917 +剔除特征索引: 78,置换重要性: 0.0000 +当前特征数: 32, CV MSE: 4.5247 +剔除特征索引: 79,置换重要性: -0.0000 +当前特征数: 31, CV MSE: 4.5843 +剔除特征索引: 81,置换重要性: 0.0000 +当前特征数: 30, CV MSE: 4.4505 +剔除特征索引: 82,置换重要性: 0.0000 +当前特征数: 29, CV MSE: 4.3227 +剔除特征索引: 83,置换重要性: 0.0000 +当前特征数: 28, CV MSE: 4.1382 +剔除特征索引: 85,置换重要性: 0.0000 +当前特征数: 27, CV MSE: 4.4850 +剔除特征索引: 87,置换重要性: 0.0000 +当前特征数: 26, CV MSE: 4.3147 +剔除特征索引: 89,置换重要性: 0.0001 +当前特征数: 25, CV MSE: 4.3994 +剔除特征索引: 90,置换重要性: 0.0000 +当前特征数: 24, CV MSE: 4.4345 +剔除特征索引: 92,置换重要性: 0.0000 +当前特征数: 23, CV MSE: 4.4428 +剔除特征索引: 94,置换重要性: 0.0000 +当前特征数: 22, CV MSE: 4.8050 +剔除特征索引: 95,置换重要性: 0.0000 +当前特征数: 21, CV MSE: 4.5067 +剔除特征索引: 96,置换重要性: 0.0004 +当前特征数: 20, CV MSE: 4.5489 +剔除特征索引: 97,置换重要性: 0.0001 +当前特征数: 19, CV MSE: 4.6281 +剔除特征索引: 98,置换重要性: 0.0009 +当前特征数: 18, CV MSE: 4.4093 +剔除特征索引: 99,置换重要性: 0.0039 +当前特征数: 17, CV MSE: 4.3291 +剔除特征索引: 100,置换重要性: 0.0007 +当前特征数: 16, CV MSE: 4.3191 +剔除特征索引: 102,置换重要性: 0.0002 +当前特征数: 15, CV MSE: 4.6593 +剔除特征索引: 103,置换重要性: -0.0002 +当前特征数: 14, CV MSE: 4.6185 +剔除特征索引: 104,置换重要性: 0.0000 +当前特征数: 13, CV MSE: 4.4200 +剔除特征索引: 109,置换重要性: 0.0000 +当前特征数: 12, CV MSE: 4.3259 +剔除特征索引: 110,置换重要性: -0.0000 +当前特征数: 11, CV MSE: 4.0543 +剔除特征索引: 112,置换重要性: 0.0002 +当前特征数: 10, CV MSE: 3.9899 +剔除特征索引: 114,置换重要性: 0.0000 +当前特征数: 9, CV MSE: 4.0719 +剔除特征索引: 117,置换重要性: 0.0062 +当前特征数: 8, CV MSE: 4.3854 +剔除特征索引: 118,置换重要性: 0.0000 +当前特征数: 7, CV MSE: 4.2360 +剔除特征索引: 120,置换重要性: 0.0022 +当前特征数: 6, CV MSE: 3.8872 +剔除特征索引: 121,置换重要性: 0.0034 +当前特征数: 5, CV MSE: 3.8296 +剔除特征索引: 123,置换重要性: 0.0022 +当前特征数: 4, CV MSE: 3.8317 +剔除特征索引: 130,置换重要性: 0.0007 +当前特征数: 3, CV MSE: 3.9287 +剔除特征索引: 132,置换重要性: 0.0000 +当前特征数: 2, CV MSE: 3.9652 +剔除特征索引: 141,置换重要性: 0.0000 +当前特征数: 1, CV MSE: 3.9652 + +手动RFECV选择了 5 个特征,CV MSE: 3.8296 +选定特征名称: ['fr_Al_OH_noTert', 'fr_Ndealkylation2', 'fr_alkyl_halide', 'fr_halogen', 'fr_piperdine'] \ No newline at end of file diff --git a/1d-qsar/cuda/qssar1d.py b/1d-qsar/cuda/qssar1d.py new file mode 100644 index 0000000..dc93cfa --- /dev/null +++ b/1d-qsar/cuda/qssar1d.py @@ -0,0 +1,152 @@ +#!/usr/bin/env python +# -*- encoding: utf-8 -*- +''' +@file: RFE_cuml_permutation.py +@Description: 使用GPU cuML实现特征选择,手动实现RFECV,并通过置换重要性计算来剔除最不重要的特征 +@Date: 2025/03/01 +@Author: lyzeng +''' + +import numpy as np +import pandas as pd + +# --------------------------- +# 数据加载与特征构造 +# --------------------------- +data = pd.read_csv("../data_smi.csv", index_col="Entry ID") +target_data = data[["SMILES", "S.aureus ATCC25923"]] +target_data.columns = ["SMILES", "TARGET"] + +from rdkit import Chem +from rdkit.Chem import Descriptors + +def getMolDescriptors(smiles: str, missingVal=None): + mol = Chem.MolFromSmiles(smiles) + res = {} + for nm, fn in Descriptors._descList: + try: + val = fn(mol) + except Exception as e: + val = missingVal + res[nm] = val + return res + +# 构建特征矩阵与目标变量 +X_df = pd.DataFrame([getMolDescriptors(smi) for smi in target_data['SMILES']]) +y = target_data['TARGET'].values + +# 剔除无效特征(取值唯一的列) +invalid_features = [col for col in X_df.columns if X_df[col].nunique() <= 1] +X_df.drop(columns=invalid_features, inplace=True) +X = X_df.values + +print(f"训练样本数: {len(y)}, 特征数量: {X.shape[1]}") + +# --------------------------- +# 转换为GPU数据格式(cupy数组) +# --------------------------- +import cupy as cp +X_gpu = cp.asarray(X) +y_gpu = cp.asarray(y) + +# --------------------------- +# 定义辅助函数:计算模型的MSE +# --------------------------- +def model_mse(model, X_data, y_data): + y_pred = model.predict(X_data) + mse = cp.mean((y_data - y_pred) ** 2) + return mse.get() + +# --------------------------- +# 定义置换重要性计算函数(在GPU上) +# --------------------------- +def permutation_importance_gpu(model, X_data, y_data, n_repeats=3, random_state=0): + """ + 对于给定的模型,计算每个特征的置换重要性: + 重要性 = (打乱特征后的MSE均值) - (原始MSE) + 随机打乱每个特征的值,并观察模型性能的下降程度来衡量特征的重要性。 如果打乱某个特征导致模型性能显著下降,说明该特征对模型很重要;反之,如果打乱某个特征对模型性能没有明显影响,说明该特征不太重要。 + """ + # 先将数据复制到CPU进行打乱操作(数据量较小时不会有太大开销) + X_cpu = cp.asnumpy(X_data) + baseline = model_mse(model, X_data, y_data) + importances = np.zeros(X_cpu.shape[1]) + rng = np.random.RandomState(random_state) + for j in range(X_cpu.shape[1]): + scores = [] + for _ in range(n_repeats): + X_permuted = X_cpu.copy() + rng.shuffle(X_permuted[:, j]) + X_permuted_gpu = cp.asarray(X_permuted) + score = model_mse(model, X_permuted_gpu, y_data) + scores.append(score) + importances[j] = np.mean(scores) - baseline + return importances + +# --------------------------- +# 使用 cuML 随机森林和置换重要性手动实现 RFECV +# --------------------------- +from cuml.ensemble import RandomForestRegressor as cuRF +from sklearn.model_selection import KFold + +def cross_val_score_gpu(X_data, y_data, cv=5): + """ + 使用 KFold 分割数据,利用 cuML 随机森林评估当前特征子集的CV均方误差 + """ + kf = KFold(n_splits=cv, shuffle=True, random_state=0) + mse_scores = [] + # 使用 CPU 索引进行分割 + X_np = cp.asnumpy(X_data) + for train_index, test_index in kf.split(X_np): + X_train = X_data[train_index, :] + X_test = X_data[test_index, :] + y_train = y_data[train_index] + y_test = y_data[test_index] + model = cuRF(n_estimators=100, random_state=0) + model.fit(X_train, y_train) + mse = model_mse(model, X_test, y_test) + mse_scores.append(mse) + return np.mean(mse_scores) + +def manual_rfecv(X_data, y_data, cv=5): + """ + 手动实现 RFECV: + - 在当前特征子集上计算CV MSE + - 使用置换重要性评估各特征的重要性 + - 移除置换重要性最低的特征 + - 记录使CV MSE最小的特征组合 + 找到一个最佳的特征组合,用于训练最终的模型。 这个子集的大小在每次迭代中都在变化,最终目标是找到一个使交叉验证 MSE 最小的子集 + cuml.ensemble.RandomForestRegressor 的 max_features 默认值通常是 sqrt (特征总数的平方根)。 这与 scikit-learn 的 RandomForestRegressor 的默认值相同。 + """ + n_features = X_data.shape[1] + features = list(range(n_features)) + best_score = float('inf') + best_features = features.copy() + scores_history = [] + + while len(features) > 0: + current_score = cross_val_score_gpu(X_data[:, features], y_data, cv=cv) + scores_history.append((features.copy(), current_score)) + print(f"当前特征数: {len(features)}, CV MSE: {current_score:.4f}") + if current_score < best_score: + best_score = current_score + best_features = features.copy() + if len(features) == 1: + break + # 训练模型并计算置换重要性 + model = cuRF(n_estimators=100, random_state=0) # cuml.ensemble.RandomForestRegressor 的 max_features 默认值通常是 sqrt (特征总数的平方根)。 这与 scikit-learn 的 RandomForestRegressor 的默认值相同。 + model.fit(X_data[:, features], y_data) + imp = permutation_importance_gpu(model, X_data[:, features], y_data, n_repeats=3, random_state=0) + idx_to_remove = int(np.argmin(imp)) + removed_feature = features[idx_to_remove] + print(f"剔除特征索引: {removed_feature},置换重要性: {imp[idx_to_remove]:.4f}") + del features[idx_to_remove] + + return best_features, best_score, scores_history + +# 执行手动 RFECV,使用5折交叉验证 +cv_folds = 5 +best_features_rfecv, best_mse_rfecv, history = manual_rfecv(X_gpu, y_gpu, cv=cv_folds) + +print(f"\n手动RFECV选择了 {len(best_features_rfecv)} 个特征,CV MSE: {best_mse_rfecv:.4f}") +selected_feature_names = [X_df.columns[i] for i in best_features_rfecv] +print("选定特征名称:", selected_feature_names) diff --git a/1d-qsar/cuda/random_forest_demo.ipynb b/1d-qsar/cuda/random_forest_demo.ipynb new file mode 100644 index 0000000..eb4e6e7 --- /dev/null +++ b/1d-qsar/cuda/random_forest_demo.ipynb @@ -0,0 +1,294 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Random Forest and Pickling\n", + "The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.\n", + "\n", + "In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.\n", + "\n", + "Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.\n", + "\n", + "For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)\n", + "\n", + "For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cudf\n", + "import numpy as np\n", + "import pandas as pd\n", + "import pickle\n", + "\n", + "from cuml.ensemble import RandomForestClassifier as curfc\n", + "from cuml.metrics import accuracy_score\n", + "\n", + "from sklearn.ensemble import RandomForestClassifier as skrfc\n", + "from sklearn.datasets import make_classification\n", + "from sklearn.model_selection import train_test_split" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define Parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The speedup obtained by using cuML'sRandom Forest implementation\n", + "# becomes much higher when using larger datasets. Uncomment and use the n_samples\n", + "# value provided below to see the difference in the time required to run\n", + "# Scikit-learn's vs cuML's implementation with a large dataset.\n", + "\n", + "# n_samples = 2*17\n", + "n_samples = 2**12\n", + "n_features = 399\n", + "n_info = 300\n", + "data_type = np.float32" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generate Data\n", + "\n", + "### Host" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "X,y = make_classification(n_samples=n_samples,\n", + " n_features=n_features,\n", + " n_informative=n_info,\n", + " random_state=123, n_classes=2)\n", + "\n", + "X = pd.DataFrame(X.astype(data_type))\n", + "# cuML Random Forest Classifier requires the labels to be integers\n", + "y = pd.Series(y.astype(np.int32))\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", + " test_size = 0.2,\n", + " random_state=0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### GPU" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n", + "X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n", + "\n", + "y_cudf_train = cudf.Series(y_train.values)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scikit-learn Model\n", + "\n", + "### Fit" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "sk_model = skrfc(n_estimators=40,\n", + " max_depth=16,\n", + " max_features=1.0,\n", + " random_state=10)\n", + "\n", + "sk_model.fit(X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "sk_predict = sk_model.predict(X_test)\n", + "sk_acc = accuracy_score(y_test, sk_predict)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## cuML Model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Fit" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "cuml_model = curfc(n_estimators=40,\n", + " max_depth=16,\n", + " max_features=1.0,\n", + " random_state=10)\n", + "\n", + "cuml_model.fit(X_cudf_train, y_cudf_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "fil_preds_orig = cuml_model.predict(X_cudf_test)\n", + "\n", + "fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pickle the cuML random forest classification model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "filename = 'cuml_random_forest_model.sav'\n", + "# save the trained cuml model into a file\n", + "pickle.dump(cuml_model, open(filename, 'wb'))\n", + "# delete the previous model to ensure that there is no leakage of pointers.\n", + "# this is not strictly necessary but just included here for demo purposes.\n", + "del cuml_model\n", + "# load the previously saved cuml model from a file\n", + "pickled_cuml_model = pickle.load(open(filename, 'rb'))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Predict using the pickled model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)\n", + "\n", + "fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compare Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"CUML accuracy of the RF model before pickling: %s\" % fil_acc_orig)\n", + "print(\"CUML accuracy of the RF model after pickling: %s\" % fil_acc_after_pickling)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"SKL accuracy: %s\" % sk_acc)\n", + "print(\"CUML accuracy before pickling: %s\" % fil_acc_orig)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}