{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Random Forest and Pickling\n", "The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.\n", "\n", "In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.\n", "\n", "Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.\n", "\n", "For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)\n", "\n", "For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cudf\n", "import numpy as np\n", "import pandas as pd\n", "import pickle\n", "\n", "from cuml.ensemble import RandomForestClassifier as curfc\n", "from cuml.metrics import accuracy_score\n", "\n", "from sklearn.ensemble import RandomForestClassifier as skrfc\n", "from sklearn.datasets import make_classification\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The speedup obtained by using cuML'sRandom Forest implementation\n", "# becomes much higher when using larger datasets. Uncomment and use the n_samples\n", "# value provided below to see the difference in the time required to run\n", "# Scikit-learn's vs cuML's implementation with a large dataset.\n", "\n", "# n_samples = 2*17\n", "n_samples = 2**12\n", "n_features = 399\n", "n_info = 300\n", "data_type = np.float32" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate Data\n", "\n", "### Host" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "X,y = make_classification(n_samples=n_samples,\n", " n_features=n_features,\n", " n_informative=n_info,\n", " random_state=123, n_classes=2)\n", "\n", "X = pd.DataFrame(X.astype(data_type))\n", "# cuML Random Forest Classifier requires the labels to be integers\n", "y = pd.Series(y.astype(np.int32))\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", " test_size = 0.2,\n", " random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GPU" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n", "X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n", "\n", "y_cudf_train = cudf.Series(y_train.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn Model\n", "\n", "### Fit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "sk_model = skrfc(n_estimators=40,\n", " max_depth=16,\n", " max_features=1.0,\n", " random_state=10)\n", "\n", "sk_model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "sk_predict = sk_model.predict(X_test)\n", "sk_acc = accuracy_score(y_test, sk_predict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## cuML Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "cuml_model = curfc(n_estimators=40,\n", " max_depth=16,\n", " max_features=1.0,\n", " random_state=10)\n", "\n", "cuml_model.fit(X_cudf_train, y_cudf_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "fil_preds_orig = cuml_model.predict(X_cudf_test)\n", "\n", "fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pickle the cuML random forest classification model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = 'cuml_random_forest_model.sav'\n", "# save the trained cuml model into a file\n", "pickle.dump(cuml_model, open(filename, 'wb'))\n", "# delete the previous model to ensure that there is no leakage of pointers.\n", "# this is not strictly necessary but just included here for demo purposes.\n", "del cuml_model\n", "# load the previously saved cuml model from a file\n", "pickled_cuml_model = pickle.load(open(filename, 'rb'))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using the pickled model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)\n", "\n", "fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compare Results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"CUML accuracy of the RF model before pickling: %s\" % fil_acc_orig)\n", "print(\"CUML accuracy of the RF model after pickling: %s\" % fil_acc_after_pickling)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"SKL accuracy: %s\" % sk_acc)\n", "print(\"CUML accuracy before pickling: %s\" % fil_acc_orig)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }