-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
790 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,356 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "Rf258xT0XIwV" | ||
}, | ||
"source": [ | ||
"# Training a Jet Regression with **DNN** \n", | ||
"\n", | ||
"---\n", | ||
"In this notebook, we perform a Jet identification task using a multiclass classifier based on a \n", | ||
"Dense Neural Network (DNN), also called multi-layer perceptron (MLP). The problem consists is \n", | ||
"regression of $tau_{32}$, given $tau_3$ and $tau_2$.\n", | ||
"\n", | ||
"For details on the physics problem, see https://arxiv.org/pdf/1804.06913.pdf \n", | ||
"\n", | ||
"For details on the dataset, see Notebook1\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"id": "4OMAZgtyXIwY" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import h5py\n", | ||
"import glob\n", | ||
"import numpy as np\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"%matplotlib inline" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "2lbB-J3hXIwb" | ||
}, | ||
"source": [ | ||
"# Preparation of the training and validation samples\n", | ||
"\n", | ||
"---\n", | ||
"In order to import the dataset, we now\n", | ||
"- clone the dataset repository (to import the data in Colab)\n", | ||
"- load the h5 files in the data/ repository\n", | ||
"- extract the data we need: a target and jetImage \n", | ||
"\n", | ||
"To type shell commands, we start the command line with !\n", | ||
"\n", | ||
"**nb, if you are running locally and you have already downloaded the datasets you can skip the cell below and, if needed, change the paths later to point to the folder with your previous download of the datasets.**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "jWjxFaRPXIwb" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"! curl https://cernbox.cern.ch/s/6Ec5pGFEpFWeH6S/download -o Data-MLtutorial.tar.gz\n", | ||
"! tar -xvzf Data-MLtutorial.tar.gz \n", | ||
"! ls Data-MLtutorial/JetDataset/\n", | ||
"! rm Data-MLtutorial.tar.gz " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"id": "cCGhrKdwXIwc" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Appending Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5\n", | ||
"Appending Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5\n", | ||
"Appending Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5\n", | ||
"Appending Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5\n", | ||
"Appending Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5\n", | ||
"(50000,) (50000, 2)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"target = np.array([])\n", | ||
"features = np.array([])\n", | ||
"ptype = np.array([])\n", | ||
"# we cannot load all data on Colab. So we just take a few files\n", | ||
"datafiles = ['Data-MLtutorial/JetDataset/jetImage_7_100p_30000_40000.h5',\n", | ||
" 'Data-MLtutorial/JetDataset/jetImage_7_100p_60000_70000.h5',\n", | ||
" 'Data-MLtutorial/JetDataset/jetImage_7_100p_50000_60000.h5',\n", | ||
" 'Data-MLtutorial/JetDataset/jetImage_7_100p_10000_20000.h5',\n", | ||
" 'Data-MLtutorial/JetDataset/jetImage_7_100p_0_10000.h5']\n", | ||
"# if you are running locallt, you can use the full dataset doing\n", | ||
"# for fileIN in glob.glob(\"tutorials/HiggsSchool/data/*h5\"):\n", | ||
"for fileIN in datafiles:\n", | ||
" print(\"Appending %s\" %fileIN)\n", | ||
" f = h5py.File(fileIN)\n", | ||
" myFeatures = np.array(f.get(\"jets\")[:,[5,6]])\n", | ||
" myptype = np.array(f.get('jets')[0:,-6:-1])\n", | ||
" mytarget = np.array(f.get('jets')[0:,10])\n", | ||
" features = np.concatenate([features, myFeatures], axis=0) if features.size else myFeatures\n", | ||
" target = np.concatenate([target, mytarget], axis=0) if target.size else mytarget\n", | ||
" ptype = np.concatenate([ptype, myptype], axis=0) if ptype.size else myptype\n", | ||
" f.close()\n", | ||
"print(target.shape, features.shape)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "6a333RYPXIwe" | ||
}, | ||
"source": [ | ||
"The dataset consists of 50000 jets, each represented by 16 features\n", | ||
"\n", | ||
"---\n", | ||
"\n", | ||
"We now shuffle the data, splitting them into a training and a validation dataset with 2:1 ratio" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "ZBqFs1eBXIwf" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"features = features[ptype[:,-1]>0]\n", | ||
"target = target[ptype[:,-1]>0]\n", | ||
"ptype = ptype[ptype[:,-1]>0]\n", | ||
"from sklearn.model_selection import train_test_split\n", | ||
"X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)\n", | ||
"print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)\n", | ||
"del features, target" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "GkNz5UAhXIwg" | ||
}, | ||
"source": [ | ||
"# DNN model building" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "tTSDOiEHXIwh" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# keras imports\n", | ||
"from tensorflow.keras.models import Model\n", | ||
"from tensorflow.keras.layers import Dense, Input, Dropout, Flatten, Activation\n", | ||
"from tensorflow.keras.utils import plot_model\n", | ||
"from tensorflow.keras import backend as K\n", | ||
"from tensorflow.keras import metrics\n", | ||
"from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, TerminateOnNaN" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "rAl0DZTxXIwi" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"input_shape = X_train.shape[1]\n", | ||
"dropoutRate = 0.0" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "2l492G8BXIwj" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"####\n", | ||
"inputArray = Input(shape=(input_shape,))\n", | ||
"#\n", | ||
"x = Dense(40, activation='relu')(inputArray)\n", | ||
"x = Dropout(dropoutRate)(x)\n", | ||
"#\n", | ||
"x = Dense(20)(x)\n", | ||
"x = Activation('relu')(x)\n", | ||
"x = Dropout(dropoutRate)(x)\n", | ||
"#\n", | ||
"x = Dense(10, activation='relu')(x)\n", | ||
"x = Dropout(dropoutRate)(x)\n", | ||
"#\n", | ||
"x = Dense(5, activation='relu')(x)\n", | ||
"#\n", | ||
"output = Dense(1)(x)\n", | ||
"####\n", | ||
"model = Model(inputs=inputArray, outputs=output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "xu8rRUkhXIwj" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"model.compile(loss='mse', optimizer='adam')\n", | ||
"model.summary()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "2HfKWoOtXIwk" | ||
}, | ||
"source": [ | ||
"We now train the model with these settings:\n", | ||
"\n", | ||
"- the **batch size** is a hyperparameter of gradient descent that controls the number of training samples to work through before the model internal parameters are updated\n", | ||
" - batch size = 1 results in fast computation but noisy training that is slow to converge\n", | ||
" - batch size = dataset size results in slow computation but faster convergence)\n", | ||
"\n", | ||
"- the **number of epochs** controls the number of complete passes through the full training dataset -- at each epoch gradients are computed for each of the mini batches and model internal parameters are updated.\n", | ||
"\n", | ||
"- the **callbacks** are algorithms used to optimize the training (full list [here](https://keras.io/api/callbacks/)):\n", | ||
" - *EarlyStopping*: stop training when a monitored metric (`monitor`) has stopped improving in the last N epochs (`patience`)\n", | ||
" - *ReduceLROnPlateau*: reduce learning rate when a metric (`monitor`) has stopped improving in the last N epochs (`patience`)\n", | ||
" - *TerminateOnNaN*: terminates training when a NaN loss is encountered" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "KzO-lyLEXIwk" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"batch_size = 128\n", | ||
"n_epochs = 100\n", | ||
"\n", | ||
"# train \n", | ||
"history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,\n", | ||
" validation_split=0.2,\n", | ||
" callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=1),\n", | ||
" ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1),\n", | ||
" TerminateOnNaN()]\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "044bCLqVXIwl" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# plot training history\n", | ||
"plt.plot(history.history['loss'])\n", | ||
"plt.plot(history.history['val_loss'])\n", | ||
"plt.yscale('log')\n", | ||
"plt.title('Training History')\n", | ||
"plt.ylabel('loss')\n", | ||
"plt.xlabel('epoch')\n", | ||
"plt.legend(['training', 'validation'], loc='upper right')\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"predict = model.predict(X_test)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"plt.scatter(y_test,predict.flatten(),s=0.1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"plt.hist(y_test,bins=100,range=(0,1),histtype='step',label='true')\n", | ||
"plt.hist(predict.flatten(),bins=100,range=(0,1),histtype='step',label='predict')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "oESSmNLxXIwm" | ||
}, | ||
"source": [ | ||
"# Plot performce with 2D histograms" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"id": "lzbQ-d0RKVmV" | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"name": "Notebook2_JetID_DNN.ipynb", | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.16" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Oops, something went wrong.