295 lines
7.3 KiB
Plaintext
295 lines
7.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Random Forest and Pickling\n",
|
|
"The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.\n",
|
|
"\n",
|
|
"In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.\n",
|
|
"\n",
|
|
"Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.\n",
|
|
"\n",
|
|
"For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)\n",
|
|
"\n",
|
|
"For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import cudf\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import pickle\n",
|
|
"\n",
|
|
"from cuml.ensemble import RandomForestClassifier as curfc\n",
|
|
"from cuml.metrics import accuracy_score\n",
|
|
"\n",
|
|
"from sklearn.ensemble import RandomForestClassifier as skrfc\n",
|
|
"from sklearn.datasets import make_classification\n",
|
|
"from sklearn.model_selection import train_test_split"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Define Parameters"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# The speedup obtained by using cuML'sRandom Forest implementation\n",
|
|
"# becomes much higher when using larger datasets. Uncomment and use the n_samples\n",
|
|
"# value provided below to see the difference in the time required to run\n",
|
|
"# Scikit-learn's vs cuML's implementation with a large dataset.\n",
|
|
"\n",
|
|
"# n_samples = 2*17\n",
|
|
"n_samples = 2**12\n",
|
|
"n_features = 399\n",
|
|
"n_info = 300\n",
|
|
"data_type = np.float32"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Generate Data\n",
|
|
"\n",
|
|
"### Host"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"X,y = make_classification(n_samples=n_samples,\n",
|
|
" n_features=n_features,\n",
|
|
" n_informative=n_info,\n",
|
|
" random_state=123, n_classes=2)\n",
|
|
"\n",
|
|
"X = pd.DataFrame(X.astype(data_type))\n",
|
|
"# cuML Random Forest Classifier requires the labels to be integers\n",
|
|
"y = pd.Series(y.astype(np.int32))\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
|
|
" test_size = 0.2,\n",
|
|
" random_state=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### GPU"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n",
|
|
"X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
|
|
"\n",
|
|
"y_cudf_train = cudf.Series(y_train.values)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Scikit-learn Model\n",
|
|
"\n",
|
|
"### Fit"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"sk_model = skrfc(n_estimators=40,\n",
|
|
" max_depth=16,\n",
|
|
" max_features=1.0,\n",
|
|
" random_state=10)\n",
|
|
"\n",
|
|
"sk_model.fit(X_train, y_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Evaluate"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"sk_predict = sk_model.predict(X_test)\n",
|
|
"sk_acc = accuracy_score(y_test, sk_predict)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## cuML Model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Fit"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"cuml_model = curfc(n_estimators=40,\n",
|
|
" max_depth=16,\n",
|
|
" max_features=1.0,\n",
|
|
" random_state=10)\n",
|
|
"\n",
|
|
"cuml_model.fit(X_cudf_train, y_cudf_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Evaluate"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"fil_preds_orig = cuml_model.predict(X_cudf_test)\n",
|
|
"\n",
|
|
"fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Pickle the cuML random forest classification model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"filename = 'cuml_random_forest_model.sav'\n",
|
|
"# save the trained cuml model into a file\n",
|
|
"pickle.dump(cuml_model, open(filename, 'wb'))\n",
|
|
"# delete the previous model to ensure that there is no leakage of pointers.\n",
|
|
"# this is not strictly necessary but just included here for demo purposes.\n",
|
|
"del cuml_model\n",
|
|
"# load the previously saved cuml model from a file\n",
|
|
"pickled_cuml_model = pickle.load(open(filename, 'rb'))\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Predict using the pickled model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%time\n",
|
|
"pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)\n",
|
|
"\n",
|
|
"fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Compare Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"CUML accuracy of the RF model before pickling: %s\" % fil_acc_orig)\n",
|
|
"print(\"CUML accuracy of the RF model after pickling: %s\" % fil_acc_after_pickling)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"SKL accuracy: %s\" % sk_acc)\n",
|
|
"print(\"CUML accuracy before pickling: %s\" % fil_acc_orig)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|