qsar/1d-qsar/cuda/random_forest_demo.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random Forest and Pickling\n",
    "The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.\n",
    "\n",
    "In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.\n",
    "\n",
    "Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.\n",
    "\n",
    "For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)\n",
    "\n",
    "For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pickle\n",
    "\n",
    "from cuml.ensemble import RandomForestClassifier as curfc\n",
    "from cuml.metrics import accuracy_score\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier as skrfc\n",
    "from sklearn.datasets import make_classification\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The speedup obtained by using cuML'sRandom Forest implementation\n",
    "# becomes much higher when using larger datasets. Uncomment and use the n_samples\n",
    "# value provided below to see the difference in the time required to run\n",
    "# Scikit-learn's vs cuML's implementation with a large dataset.\n",
    "\n",
    "# n_samples = 2*17\n",
    "n_samples = 2**12\n",
    "n_features = 399\n",
    "n_info = 300\n",
    "data_type = np.float32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### Host"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "X,y = make_classification(n_samples=n_samples,\n",
    "                          n_features=n_features,\n",
    "                          n_informative=n_info,\n",
    "                          random_state=123, n_classes=2)\n",
    "\n",
    "X = pd.DataFrame(X.astype(data_type))\n",
    "# cuML Random Forest Classifier requires the labels to be integers\n",
    "y = pd.Series(y.astype(np.int32))\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "                                                    test_size = 0.2,\n",
    "                                                    random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n",
    "X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
    "\n",
    "y_cudf_train = cudf.Series(y_train.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "sk_model = skrfc(n_estimators=40,\n",
    "                 max_depth=16,\n",
    "                 max_features=1.0,\n",
    "                 random_state=10)\n",
    "\n",
    "sk_model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "sk_predict = sk_model.predict(X_test)\n",
    "sk_acc = accuracy_score(y_test, sk_predict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "cuml_model = curfc(n_estimators=40,\n",
    "                   max_depth=16,\n",
    "                   max_features=1.0,\n",
    "                   random_state=10)\n",
    "\n",
    "cuml_model.fit(X_cudf_train, y_cudf_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "fil_preds_orig = cuml_model.predict(X_cudf_test)\n",
    "\n",
    "fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pickle the cuML random forest classification model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = 'cuml_random_forest_model.sav'\n",
    "# save the trained cuml model into a file\n",
    "pickle.dump(cuml_model, open(filename, 'wb'))\n",
    "# delete the previous model to ensure that there is no leakage of pointers.\n",
    "# this is not strictly necessary but just included here for demo purposes.\n",
    "del cuml_model\n",
    "# load the previously saved cuml model from a file\n",
    "pickled_cuml_model = pickle.load(open(filename, 'rb'))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict using the pickled model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)\n",
    "\n",
    "fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"CUML accuracy of the RF model before pickling: %s\" % fil_acc_orig)\n",
    "print(\"CUML accuracy of the RF model after pickling: %s\" % fil_acc_after_pickling)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"SKL accuracy: %s\" % sk_acc)\n",
    "print(\"CUML accuracy before pickling: %s\" % fil_acc_orig)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}