Files
qsar/1d-qsar/cuda/random_forest_demo.ipynb
mm644706215 4cb2d9f56c add 1dqsar
2025-03-03 20:23:09 +08:00

295 lines
7.3 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Forest and Pickling\n",
"The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.\n",
"\n",
"In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.\n",
"\n",
"Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.\n",
"\n",
"For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)\n",
"\n",
"For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"import numpy as np\n",
"import pandas as pd\n",
"import pickle\n",
"\n",
"from cuml.ensemble import RandomForestClassifier as curfc\n",
"from cuml.metrics import accuracy_score\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier as skrfc\n",
"from sklearn.datasets import make_classification\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The speedup obtained by using cuML'sRandom Forest implementation\n",
"# becomes much higher when using larger datasets. Uncomment and use the n_samples\n",
"# value provided below to see the difference in the time required to run\n",
"# Scikit-learn's vs cuML's implementation with a large dataset.\n",
"\n",
"# n_samples = 2*17\n",
"n_samples = 2**12\n",
"n_features = 399\n",
"n_info = 300\n",
"data_type = np.float32"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate Data\n",
"\n",
"### Host"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"X,y = make_classification(n_samples=n_samples,\n",
" n_features=n_features,\n",
" n_informative=n_info,\n",
" random_state=123, n_classes=2)\n",
"\n",
"X = pd.DataFrame(X.astype(data_type))\n",
"# cuML Random Forest Classifier requires the labels to be integers\n",
"y = pd.Series(y.astype(np.int32))\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
" test_size = 0.2,\n",
" random_state=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GPU"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n",
"X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
"\n",
"y_cudf_train = cudf.Series(y_train.values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn Model\n",
"\n",
"### Fit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"sk_model = skrfc(n_estimators=40,\n",
" max_depth=16,\n",
" max_features=1.0,\n",
" random_state=10)\n",
"\n",
"sk_model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"sk_predict = sk_model.predict(X_test)\n",
"sk_acc = accuracy_score(y_test, sk_predict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cuML Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"cuml_model = curfc(n_estimators=40,\n",
" max_depth=16,\n",
" max_features=1.0,\n",
" random_state=10)\n",
"\n",
"cuml_model.fit(X_cudf_train, y_cudf_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"fil_preds_orig = cuml_model.predict(X_cudf_test)\n",
"\n",
"fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pickle the cuML random forest classification model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"filename = 'cuml_random_forest_model.sav'\n",
"# save the trained cuml model into a file\n",
"pickle.dump(cuml_model, open(filename, 'wb'))\n",
"# delete the previous model to ensure that there is no leakage of pointers.\n",
"# this is not strictly necessary but just included here for demo purposes.\n",
"del cuml_model\n",
"# load the previously saved cuml model from a file\n",
"pickled_cuml_model = pickle.load(open(filename, 'rb'))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict using the pickled model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)\n",
"\n",
"fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"CUML accuracy of the RF model before pickling: %s\" % fil_acc_orig)\n",
"print(\"CUML accuracy of the RF model after pickling: %s\" % fil_acc_after_pickling)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"SKL accuracy: %s\" % sk_acc)\n",
"print(\"CUML accuracy before pickling: %s\" % fil_acc_orig)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}