Part 2 Notebooks

AdrianAlan · Jul 19, 2023 · e72be04 · e72be04
1 parent eac9b9f
commit e72be04
Show file tree

Hide file tree

Showing 2 changed files with 790 additions and 0 deletions.
diff --git a/part2/2.JetRegressionMLP.ipynb b/part2/2.JetRegressionMLP.ipynb
@@ -0,0 +1,356 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Rf258xT0XIwV"
+      },
+      "source": [
+        "# Training a Jet Regression with **DNN** \n",
+        "\n",
+        "---\n",
+        "In this notebook, we perform a Jet identification task using a multiclass classifier based on a \n",
+        "Dense Neural Network (DNN), also called multi-layer perceptron (MLP). The problem consists is \n",
+        "regression of $tau_{32}$, given $tau_3$ and $tau_2$.\n",
+        "\n",
+        "For details on the physics problem, see https://arxiv.org/pdf/1804.06913.pdf \n",
+        "\n",
+        "For details on the dataset, see Notebook1\n",
+        "\n",
+        "---"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "id": "4OMAZgtyXIwY"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "import h5py\n",
+        "import glob\n",
+        "import numpy as np\n",
+        "import matplotlib.pyplot as plt\n",
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "2lbB-J3hXIwb"
+      },
+      "source": [
+        "# Preparation of the training and validation samples\n",
+        "\n",
+        "---\n",
+        "In order to import the dataset, we now\n",
+        "- clone the dataset repository (to import the data in Colab)\n",
+        "- load the h5 files in the data/ repository\n",
+        "- extract the data we need: a target and jetImage \n",
+        "\n",
+        "To type shell commands, we start the command line with !\n",
+        "\n",
+        "**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "jWjxFaRPXIwb"
+      },
+      "outputs": [],
+      "source": [
+        "! curl https://cernbox.cern.ch/s/6Ec5pGFEpFWeH6S/download -o Data-MLtutorial.tar.gz\n",
+        "! tar -xvzf Data-MLtutorial.tar.gz \n",
+        "! ls Data-MLtutorial/JetDataset/\n",
+        "! rm Data-MLtutorial.tar.gz "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "id": "cCGhrKdwXIwc"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Appending Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5\n",
+            "Appending Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5\n",
+            "Appending Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5\n",
+            "Appending Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5\n",
+            "Appending Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5\n",
+            "(50000,) (50000, 2)\n"
+          ]
+        }
+      ],
+      "source": [
+        "target = np.array([])\n",
+        "features = np.array([])\n",
+        "ptype = np.array([])\n",
+        "# we cannot load all data on Colab. So we just take a few files\n",
+        "datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n",
+        "             'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n",
+        "             'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n",
+        "             'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n",
+        "             'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n",
+        "# if you are running locallt, you can use the full dataset doing\n",
+        "# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n",
+        "for fileIN in datafiles:\n",
+        "    print(\"Appending %s\" %fileIN)\n",
+        "    f = h5py.File(fileIN)\n",
+        "    myFeatures = np.array(f.get(\"jets\")[:,[5,6]])\n",
+        "    myptype = np.array(f.get('jets')[0:,-6:-1])\n",
+        "    mytarget = np.array(f.get('jets')[0:,10])\n",
+        "    features = np.concatenate([features, myFeatures], axis=0) if features.size else myFeatures\n",
+        "    target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n",
+        "    ptype = np.concatenate([ptype, myptype], axis=0) if ptype.size else myptype\n",
+        "    f.close()\n",
+        "print(target.shape, features.shape)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6a333RYPXIwe"
+      },
+      "source": [
+        "The dataset consists of 50000 jets, each represented by 16 features\n",
+        "\n",
+        "---\n",
+        "\n",
+        "We now shuffle the data, splitting them into a training and a validation dataset with 2:1 ratio"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ZBqFs1eBXIwf"
+      },
+      "outputs": [],
+      "source": [
+        "features = features[ptype[:,-1]>0]\n",
+        "target = target[ptype[:,-1]>0]\n",
+        "ptype = ptype[ptype[:,-1]>0]\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)\n",
+        "print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)\n",
+        "del features, target"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "GkNz5UAhXIwg"
+      },
+      "source": [
+        "# DNN model building"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "tTSDOiEHXIwh"
+      },
+      "outputs": [],
+      "source": [
+        "# keras imports\n",
+        "from tensorflow.keras.models import Model\n",
+        "from tensorflow.keras.layers import Dense, Input, Dropout, Flatten, Activation\n",
+        "from tensorflow.keras.utils import plot_model\n",
+        "from tensorflow.keras import backend as K\n",
+        "from tensorflow.keras import metrics\n",
+        "from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "rAl0DZTxXIwi"
+      },
+      "outputs": [],
+      "source": [
+        "input_shape = X_train.shape[1]\n",
+        "dropoutRate = 0.0"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "2l492G8BXIwj"
+      },
+      "outputs": [],
+      "source": [
+        "####\n",
+        "inputArray = Input(shape=(input_shape,))\n",
+        "#\n",
+        "x = Dense(40, activation='relu')(inputArray)\n",
+        "x = Dropout(dropoutRate)(x)\n",
+        "#\n",
+        "x = Dense(20)(x)\n",
+        "x = Activation('relu')(x)\n",
+        "x = Dropout(dropoutRate)(x)\n",
+        "#\n",
+        "x = Dense(10, activation='relu')(x)\n",
+        "x = Dropout(dropoutRate)(x)\n",
+        "#\n",
+        "x = Dense(5, activation='relu')(x)\n",
+        "#\n",
+        "output = Dense(1)(x)\n",
+        "####\n",
+        "model = Model(inputs=inputArray, outputs=output)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "xu8rRUkhXIwj"
+      },
+      "outputs": [],
+      "source": [
+        "model.compile(loss='mse', optimizer='adam')\n",
+        "model.summary()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "2HfKWoOtXIwk"
+      },
+      "source": [
+        "We now train the model with these settings:\n",
+        "\n",
+        "- the **batch size** is a hyperparameter of gradient descent that controls the number of training samples to work through before the model internal parameters are updated\n",
+        "    - batch size = 1 results in fast computation but noisy training that is slow to converge\n",
+        "    - batch size = dataset size results in slow computation but faster convergence)\n",
+        "\n",
+        "- the **number of epochs** controls the number of complete passes through the full training dataset -- at each epoch gradients are computed for each of the mini batches and model internal parameters are updated.\n",
+        "\n",
+        "- the **callbacks** are algorithms used to optimize the training (full list [here](https://keras.io/api/callbacks/)):\n",
+        "    - *EarlyStopping*: stop training when a monitored metric (`monitor`) has stopped improving in the last N epochs (`patience`)\n",
+        "    - *ReduceLROnPlateau*: reduce learning rate when a metric (`monitor`) has stopped improving in the last N epochs (`patience`)\n",
+        "    - *TerminateOnNaN*: terminates training when a NaN loss is encountered"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "KzO-lyLEXIwk"
+      },
+      "outputs": [],
+      "source": [
+        "batch_size = 128\n",
+        "n_epochs = 100\n",
+        "\n",
+        "# train \n",
+        "history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n",
+        "                validation_split=0.2,\n",
+        "                callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n",
+        "                            ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n",
+        "                            TerminateOnNaN()]\n",
+        "                )"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "044bCLqVXIwl"
+      },
+      "outputs": [],
+      "source": [
+        "# plot training history\n",
+        "plt.plot(history.history['loss'])\n",
+        "plt.plot(history.history['val_loss'])\n",
+        "plt.yscale('log')\n",
+        "plt.title('Training History')\n",
+        "plt.ylabel('loss')\n",
+        "plt.xlabel('epoch')\n",
+        "plt.legend(['training', 'validation'], loc='upper right')\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "predict = model.predict(X_test)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "plt.scatter(y_test,predict.flatten(),s=0.1)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "plt.hist(y_test,bins=100,range=(0,1),histtype='step',label='true')\n",
+        "plt.hist(predict.flatten(),bins=100,range=(0,1),histtype='step',label='predict')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "oESSmNLxXIwm"
+      },
+      "source": [
+        "# Plot performce with 2D histograms"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "lzbQ-d0RKVmV"
+      },
+      "outputs": [],
+      "source": []
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "name": "Notebook2_JetID_DNN.ipynb",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.16"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}