From 2ec6c95de00c2041f28b467acea1e8875a1833c3 Mon Sep 17 00:00:00 2001
From: Dokyoon Yoon <32949939+Yoondokyoon@users.noreply.github.com>
Date: Thu, 25 Oct 2018 11:22:37 +0900
Subject: [PATCH] Update Fraud detection Kernel as Korean about 90% (#44)

translating
---
 Korean/Carvana_3rd_place_solution.ipynb | 226 ++++++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 Korean/Carvana_3rd_place_solution.ipynb
diff --git a/Korean/Carvana_3rd_place_solution.ipynb b/Korean/Carvana_3rd_place_solution.ipynb
new file mode 100644
index 0000000..e4b1add
--- /dev/null
+++ b/Korean/Carvana_3rd_place_solution.ipynb
@@ -0,0 +1,226 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Carvana-3rd place solution.ipynb",
+      "version": "0.3.2",
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "metadata": {
+        "id": "dHxeclomxAef",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "이렇게 흥미로운 컴페티션을 개최해 주셔서 감사합니다! 이번 컴페티션에 정말 재밌게 시간을 쏟을 수 있었습니다.\n",
+        "\n",
+        "그리고 @Peter의 유용한 코드와 @HengCher Keng의 훌륭한 아이디어에 감사드립니다.\n",
+        "\n",
+        "덕분에 많이 배웠습니다.\n",
+        "\n",
+        "컴페티션 저장소입니다. 여기에 두 개의 스크립트(network script, loss fucntion script)를 저장했습니다.\n",
+        "\n",
+        "[https://github.com/lyakaap/Kaggle-Carvana-3rd-place-solution](https://github.com/lyakaap/Kaggle-Carvana-3rd-place-solution)"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "V4CebvOuxwTD",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Thanks for hosting such an exciting competition! I was really enthusiastic for spending my time for this competition! And thanks for helpful code by @Peter and excellent ideas by @HengCher Keng. I learned a lot from them.\n",
+        "\n",
+        "The competition repository is here. I put two scripts (My network script & loss functions script) in it. https://github.com/lyakaap/Kaggle-Carvana-3rd-place-solution\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "tLHe2dpmxzzy",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Solution 개요\n",
+        "\n",
+        "\n",
+        "*   1536x1024 & 1920x1280 해상도를 사용했다.\n",
+        "*   U-net을 수정했습니다. bottleneck block에 몇개의 팽창한<sup>dilated</sup> convolution layer가 있다. (feature map의 resolution이 가장 낮은 곳)\n",
+        "\n",
+        "제 네트워크 아키텍처의 세부 사항이 여기에 있습니다\n",
+        "\n",
+        "My solution overview\n",
+        "\n",
+        "    I used 1536x1024 & 1920x1280 resolution.\n",
+        "\n",
+        "    I used modified U-Net. It has several dilated convolution layers in bottleneck block. (i.e. where the resolution of feature maps are lowest)\n",
+        "\n",
+        "Detailed figure of my network architecture is here.\n",
+        "\n",
+        "![alt text](https://kaggle2.blob.core.windows.net/forum-message-attachments/225523/7428/network.png)\n",
+        "\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "ZZaWkBDyyqmf",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "이 모델의 최고 점수는 0.997193이며 파라미터는 약 850만개에 불과합니다.\n",
+        "\n",
+        "(6 fold 중 하나 학습, TTA나 ensemble을 사용안함, input 해상도: 1920x1280) 0.997193모델의 두 가지 예측 값(TTA, 원본 영상&플립<sup>flipped</sup>영상)을 평균시키면 0.997223까지 도달할 수 있습니다.\n",
+        "\n",
+        "각각 LB에서 6위와 5위를 차지합니다."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "SUqg8sEHyrcB",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "The best score of this model is 0.997193 only around 8.5 million parameters. (trained one of 6 folds, no TTA & no ensemble, input resolution: 1920x1280) Averaging two predictions(TTA, original image & flipped image) by 0.997193 model reached 0.997222. They are ranked 6th place and 5th place on LB respectively!"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "MuS2KtZF0C8a",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "저는 bottleneck 블록에서 팽창<sup>dilated</sup>된 convolution layer 대신에 일반적인 convolution layer를 시도했습니다.\n",
+        "\n",
+        "그리고 그 점수는 팽창<sup>dilated</sup>된 convolution을 사용하는 것보다 매우 낮습니다. (normal: 0.9905, 팽창된 conv: 0.9918, @256*256)\n",
+        "\n",
+        "저는 또한 층을 쌓는<sup>stacking</sup> 대신 평행하게 팽창<sup>dilated</sup>된 convolution layer를 사용하였습니다. \n",
+        "\n",
+        "그러나 이는 stacked architecture보다 낮은 점수가 나옵니다."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "n6dE5EWv0bLW",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "I tried normal convolution layers instead of dilated convolution layers in bottleneck block, and its score is significantly lower than using dilated convolution. (normal: 0.9905, using dilated conv: 0.9918 @256x256)\n",
+        "\n",
+        "I also tried parallelized dilated convolution layers instead of stacking them, but it gave me lower score than stacked architecture."
+      ]
+    },
+    {
+      "metadata": {
+        "id": "yRjk8-J21jFV",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Optimizer: RMSprop lr=0.0002, ReduceLROnPlateau()를 사용하여 학습 속도를 줄입니다. Reducing factor는 0.2 & 0.5입니다.\n",
+        "\n",
+        "Data Augmentation: 수평 플립<sup>flip</sup>만 사용. 스케일링, 시프트<sup>shifting</sup>, 그리고 HSV 시프트는 overfitting 되어 나왔다.\n",
+        "\n",
+        "batchsize: 1, 그리고 BN 없음\n",
+        "\n",
+        "single 모델을 Training하는데 2일 정도 시간이 걸림.\n",
+        "\n",
+        "Pseudo Labeling: 동시에 학습하거나 pretraining phase에만 사용합니다.\n",
+        "\n",
+        "Loss function: bce + dice loss(weighing boundary pixel loss도 써봤지만, 비슷한 결과를 주었습니다. overfitting이 두려워 결국 사용하지는 않았습니다.)\n",
+        "\n",
+        "Ensemble : 5 fold ensemble @1536x1024 + 6 fold ensemble @1920x1280, 가중 평균, submission에서 LB 순위에 따라 가중치를 부여했다. \n",
+        "\n",
+        "TTA: 수평 플립만\n",
+        "\n",
+        "임계값 조정: validation set에서 최고 점수를 임계값으로 정했습니다(0.508). LB에서는 겨우 0.000001밖에 점수가 오르지 않았습니다.\n",
+        "\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "C9kjXupa1meN",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "    Optimizer: RMSprop lr = 0.0002, reducing learning rate by using ReduceLROnPlateau() that is Keras callback function. Reducing factor is 0.2 & 0.5\n",
+        "\n",
+        "    Data Augmentation: only horizontal flip. Scaling, Shifting, and Shifting HSV were results of overfitting for me.\n",
+        "\n",
+        "    Batchsize: 1, and no BN.\n",
+        "\n",
+        "    Training whole time on single model takes around 2 days.\n",
+        "\n",
+        "    Pseudo Labeling: learning simultaneously or only using pretraining phase.\n",
+        "\n",
+        "    Loss function: bce + dice loss (I also tried weighing boundary pixel loss, it gave similar result. Fear of overfitting, I finally decided not to use it.)\n",
+        "\n",
+        "    Ensemble: 5 fold ensemble @1536x1024 + 6 fold ensemble @1920x1280, weighted average. I weighted by LB ranking in my submissions.\n",
+        "\n",
+        "    TTA: only horizontal flip.\n",
+        "\n",
+        "    Adjusting threshold: I decided threshold which gives best score on validation set. I set the threshold to 0.508. In LB, it makes score improving only 0.000001.\n",
+        "\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "qU3cw_Kt4QPx",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "# Other\n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "DfVVddSA4S_-",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "점수의 개선의 가장 큰 요인은 pseudo labeling data에서의 훈련이다.\n",
+        "\n",
+        "pseudo labelling data가 좋은 이유는 test data의 양 때문이라고 생각합니다. \n",
+        "\n",
+        "우리는 ground truth에 가까운 예측을 할 수 있습니다.\n",
+        "\n",
+        "자동차 이미지를 가리기<sup>mask</sup> 어려운 경우에만 pydensecrf를 사용하여 전처리하였다. \n",
+        "\n",
+        "하지만 그것은 아무런 성능 향상도 주지 못했습니다.\n",
+        "\n",
+        "\"어려운 이미지<sup>difficult images</sup>\"를 선택하는 방법으로 여러가지 예측에 대한  multi class version of dice coefficient로 계산했습니다. \n"
+      ]
+    },
+    {
+      "metadata": {
+        "id": "dDnDQ94v5swP",
+        "colab_type": "text"
+      },
+      "cell_type": "markdown",
+      "source": [
+        "Other\n",
+        "\n",
+        "One of the best contributer of improving score is training on pseudo labeling data. I think why pseudo labeling contribute so much is the amount of test data, and we can get predictions close ground truth.\n",
+        "\n",
+        "I tried post processing by using pydensecrf for only difficult to mask car images. But it gave me no improvement. As for how to choose \"difficult images\", I calculated multi class version of dice coeficient (I'm afraid that I shouldn't say so) of predictions by several models."
+      ]
+    }
+  ]
+}
\ No newline at end of file