rapport.Rmd

---
title: "IML assignment 1"
author: "Emil Skindersø, Christopher Fjeldberg Jensen and Hans Vinther-Larsen"
date: "2024-02-26"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, echo=FALSE, message=FALSE,warning=FALSE}
library(tidyverse)
source("utils.R")
source("libraries.R")
source("extra_pipeops.R")
loadData()
```
# Introduction

In this project we use different machinelearning models to predict the claim amount for individual customers of an insurance company. Particularly, we are interested in predicting the claim amount for customers who have had insurance for at least a whole year (i.e exposure=1). Aside from choosing suitable factor encodings and models, we are interested in optimizing our choices such that the predictions are best at exposure=1. To this end, we employ weightings and custom measures (see the following paragraphs).
The code for this project can be found on github at https://github.com/HVinther/IML_assignment_1

# Data exploration, encoding and weighting

The dataset is a subset of the freMPL datasets from the CASdatasets library. It is split into training and testing with 80% of the data being used for training. 

```{r}
head(freMPL)
```

The data consists of some variables such as vehicle price and age, as well as social class, which can be encoded either as factors or as numerical. Additionally for categorical variables, we try with dummy and target encoding.

Since we are interested in optimizing best for exposure=1, we make some weight functions such that individuals whose exposure is closer to 1 are more important that individuals with low exposure.

## Encoding of dates and handling of missing values

Two <Date> columns are present in the dataset, RecordBeg and RecordEnd representing the beginning and end dates of recording. Several choices have to be made in the handling of these variables. First of, the `as_task` functions from `mlr3`cannot handle the <Date> datatype. A quick reencoding as `POSIXct`overcomes this initial hurdle - this will however need further encoding, as most models cannot handle the `POSIXct` class directly, but this can now be done inside a GraphLearner object. 

Before settling on an encoding strategy a second issue has to be considered. The RecordEnd variable has an abundance of missing values, in both the full dataset and in the training dataset about $46\%$ of the observations of this variable are missing. In particular these seem to be a result of right censoring, as all observations are within the year 2004, these particular observations are those, where the record has not ended within the year. Imputing these variables based on the nonmissing would expectedly result in an under estimation of their values.

We suggest and investigate the following encoding strategies for these variables:

A numerical encoding, imputing using latest observed date and indicating censoring

```{r, eval = FALSE}
# Interpretes RecordEnd as being right censored.
# Thus imputes using the maximal observed time, 
# and adds column indicating censoring.
po_RecEnd_rc<-po(
  "mutate",
  mutation = list(
    RecordEnd = ~as.numeric(ifelse(is.na(RecordEnd),
                                   max(na.omit(RecordEnd)),
                                   RecordEnd)),
    RecordEnd_censored = ~is.na(RecordEnd)
  ),
  id = "RecEnd_rc"
)

# Reencodes RecordBeg (POSIXct) to numeric
po_RecBeg_num<-po(
  "mutate",
  mutation = list(
    RecordBeg = ~as.numeric(RecordBeg)
  ),
  id = "RecBeg_num"
)
```

An alternative method would be to use the datefeatures encoding found in `mlr3`, as seen below.

```{r, eval = FALSE}
  po("missind") %>>%
  po("select",
       selector = selector_invert(selector_name("RecordEnd")))%>>%
  po("datefeatures")
```


## Encoding of remaining covariates

We present some adhoc encoding for a few of the remaining variables, mainly SocioCateg, VehPrice and VehAge as these have a natural ordering not captured by their default ordering.

SocioCateg describes the "social category" of the individual claimant and ranges from CSP1 to CSP99 with trailing numbers indicating the levels, with a higher number indicating a higher status. We propose to strip away the letters and encode the remaining numbers as integers.

VehAge describes the age of the vehicle and contains the levels 0 trough 5 and 6-7, 8-9 and finally 10+. We propose to encode these as 0 through 5, 6.5,8.5 and finally 10, all as numerics.

Finally VehPrice indicates a pricing category for the price of the vehicle using letters, with "A" apparently indicating a cheaper vehicle while later letters indicate higher prices. We simply propose to encode this as an integer instead of a factor.

The implementations are as seen below, but can also be found in the file extra_pipeops.R on the github repository.

```{r, eval = FALSE}
# Reencodes SocioCateg as an integer using the numbering provided in the factor
po_SocCat_int<-po(
  "mutate",
  id = "SocioCateg_as_integer",
  mutation = list(
    SocioCateg = ~as.integer(substr(SocioCateg,4,5)))
)

# Reencodes VehPrice as integer using the default ordering
po_VehPrice_int<-po(
  "mutate",
  id = "VehPrice_as_integer",
  mutation = list(
    VehPrice = ~as.integer(VehPrice))
)

# Reencodes VehAge as numeric using the "mean" of provided intervals
po_VehAge_num<-po(
  "mutate",
  id = "VehAge_as_num",
  mutation = list(
    VehAge = ~dplyr::case_match(VehAge,
                            "6-7" ~ 6.5,
                            "8-9" ~ 8.5,
                            "10+" ~ 10,
                            .default = as.numeric(VehAge)))
)
```


## Weighting

Some measures and learners allows the use of weights. We will explore two use cases of this features.

First of, we might weight by interest. As we are actually only interested in approximating $\mathbb{E}[|\text{Exposure} = 1]$, we might create some transformation function  $f:[0,1]\longrightarrow\mathbb{R}_{>=0}$ and evaluate it in the Exposure to indicate our interest in this observation. With a good choice of $f$, this should allow us to prioritize a precise fit on the observations of high interest, while hopefully still extracting some information from the remaining dataset. For the function $f$ we propose a sigmoid function of the form,

$$w_i = f_{k,l}(x_i):= \frac{\exp((x_i-l)\cdot k)}{1+\exp((x_i-l)\cdot k)}.$$
The parameters $k$ and $l$ represent a stretching and shifting respectively of the standard sigmoid function. Other functions, as well as tuning of these hyperparameters, could be considered, but this is seen as out of scope of this project. We will adhoc set our parameters as follows, $k = 12$ and $l = 0.5$, yielding the following transformation function.

```{r, echo=FALSE}
x<- seq(from = 0,
        to = 1,
        length.out = 101)

sigmoid <- function(x,k = 12){
  inner <- exp((x-0.5)*k)
  return(inner/(1+inner))
}

ggplot()+
  geom_line(
    aes(x = x,
        y = sigmoid(x)))+
  labs(x = "Exposure",
       y = "weight")
```

An alternative weighting approach more inline with standard methods would be to weight to tackle the imbalance with respect to the sign of `ClaimAmount`, as seen in the table below. 

```{r, echo = FALSE}
train |> 
  group_by(sign(ClaimAmount))|>
  summarize(rel_freq = n()/nrow(train))|>
  mutate(reciprocal_rel_freq = 1/rel_freq)
```

The vast majority of observations about $94.7\%$ have `ClaimAmount` equal to zero, while just under $5\%$ have a positive sign, leaving less than $0.5\%$ of the training data with a negative sign. One way to deal with this would be to weight using the reciprocal relative frequency, which should lead to an equal balancing between the three groups.

One problem with the latter weighting is that it depends on the response, thus it is more troublesome to include in 'mlr3' pipelines. To amend this we have created the function 'add_weight' which takes a Task or data.frame and outputs an object of similar class but with the specified weights. Thus the weighting will appear as a part of the task instead of a part of the learning pipeline. Further we might want to use both weights, for this we simple combine them by multiplication. Note the default of the function `add_weight` is to add both weightings. The implementations can be seen below or found in file extra_pipeops.R on the github repository.

```{r, eval = FALSE}
## Frequency weight can be added using the add_weight function. The sigmoid function below
## is just a helper function for add_weight
.sigmoid <- function(x,k = 12, location = 0.5){
  inner <- exp((x-location)*k)
  return(inner/(1+inner))
}

## The add_weigth function can be called on a data.frame or a Task. It can fx. be used on 
## an unweighted Task inside a bencmark call ie.:
## benchmark(add_weight(task),learner,resampler)
add_weight<-function(dataset,
                     weighting = c("interest","frequency"),
                     case_values = c(213,1,21),
                     k = 12,
                     location = 0.5){
  stopifnot(any(weighting %in% c("interest","frequency")))
  if("data.frame" %in% class(dataset)){
    w<-rep(1,times = nrow(dataset))
    if("frequency"%in% weighting){
      w<-w*case_match(sign(dataset$ClaimAmount),
                      -1 ~ case_values[1],
                      0 ~ case_values[2],
                      1 ~ case_values[3])
    }
    if("interest" %in% weighting){
      w<-w*.sigmoid(dataset$Exposure,k, location)
    }
    return(mutate(dataset,weights = w))
  } else if("Task" %in% class(dataset)){
    data <- dataset$data() |> as.data.frame()
    w<-rep(1,times = nrow(data))
    if("frequency"%in% weighting){
      w<-w*case_match(sign(data$ClaimAmount),
                      -1 ~ case_values[1],
                      0 ~ case_values[2],
                      1 ~ case_values[3])
    }
    if("interest" %in% weighting){
      w<-w*.sigmoid(data$Exposure,k, location)
    }
    out<-as_task_regr(
      mutate(data,weights = w),
      target = dataset$col_roles$target)
    out$set_col_roles("weights",roles = "weight")
    out$id <- paste(c(dataset$id,"weight",paste(weighting,collapse = "_")),collapse = "_")
    return(out)
  }
}
```

Finally, tasks with added weight will have the string "_weight" alongside the weightings used to their original id.

## Custom Measure

Since we are only interested in accurately predicting given Exposure = 1, the normal mean squared error might lead to wrong conclusions about performance, when an algorithm is trained using the entire dataset. Thus we implement a measure that takes the mean square error only using the observation with Exposure = 1. Thus if we denote the set of row indices where exposure is 1 S, ie. $S=\{i \in \{1,..,n\}|Exposure_i =1\}$, and label the vector true responses $t$ and their predictions $p$, then

$$\text{mse_inter}(t,p):=\left\{\begin{array}{cl}\frac{1}{|S|}\sum_{i\in S}(t_i-p_i)^2, & S\neq \emptyset\\ 0, &S= \emptyset\end{array} \right.$$

The measure is implemented as seen in the code below. In the project it is defined in the extra_pipeops.R file. Note, the current implementation is a bit hacky, as it does not support splitting inside a train method call, ie. calls of the form '\$train(task, row_ids = split\$train)'. As this is not important for this project the bug has not been solved. It does however work with standard use in auto tuners and benchmarks, where it will be used. 

```{r, eval = FALSE}
# Measure to compare models on Exposure == 1 using mse
MeasureRegrMSEinterest = R6::R6Class("MeasureRegrMSEinterest",
  inherit = mlr3::MeasureRegr, # regression measure
  public = list(
    initialize = function() { # initialize class
      super$initialize(
        id = "mse_inter", # unique ID
        packages = character(), # no package dependencies
        properties = "requires_task", # requires a task
        predict_type = "response", # measures response prediction
        range = c(0, Inf), # results in values between (0, Inf)
        minimize = TRUE # smaller values are better
      )
    }
  ),

  private = list(
    # define score as private method
    .score = function(prediction, task,...) {
      # define loss
      expo<-task$data()$Exposure[prediction$row_ids]
      ind_of_interest <- (expo == 1)
      mean((prediction$truth[ind_of_interest] - prediction$response[ind_of_interest])^2)
      }
  )
)

# adds the measure to the dictionary
mlr_measures$add("regr.mse_inter",MeasureRegrMSEinterest)
```


# Choosing Learners and Encoding

When implementing our algorithm we decided to focus our attention on three different learners: Glmnet, Gradient boost and random forest. We also considered trying with a k-neareast neighbours learner, but we deemed that the dimension was too large
for it to get a good result.

Also to choose our best performing algorithm we decided on the approach of first comparing encoding strategies within the same and learner and choose the best performing algorithm for each learner, and then compare our best algorithm along with featureless learner. When deciding on which algorithm is the best performing we resample with 5-fold cross-validation and then we plot the mean squared error. 

## Encoding strategy

For a more fair comparison of our algorithms we decided on a common encoding strategy across all the learners. We begin with some simple baseline graphs, with no weighting and no custom encoding. Next we create some graphs with our own encoder for the `SocioCateg` variable.

```{r, eval=FALSE}
Dummy_lrn <- 
  po("encode") %>>% 
  po_RecBeg_num %>>% 
  po_RecEnd_rc %>>% 
  po("scale") %>>% 
  lrn_obj |> 
  as_learner()

Target_lrn <- 
  po("encodeimpact") %>>% 
  po_RecBeg_num %>>% 
  po_RecEnd_rc %>>% 
  po("scale") %>>% 
  lrn_obj |> 
  as_learner()

Dummy_lrn_custom <- 
  po_VehAge_num %>>% 
  po_VehPrice_int %>>% 
  po_SocCat_int %>>% 
  po("encode") %>>% 
  po_RecBeg_num %>>% 
  po_RecEnd_rc %>>% 
  po("scale") %>>% 
  lrn_obj |> 
  as_learner()

Target_lrn_custom <- 
  po_VehAge_num %>>% 
  po_VehPrice_int %>>% 
  po_SocCat_int %>>% 
  po("encodeimpact") %>>% 
  po_RecBeg_num %>>% 
  po_RecEnd_rc %>>% 
  po("scale") %>>% 
  lrn_obj |> 
  as_learner()
```

We have chosen these encoding strategies because dummy and target encoding are two commonly used encoding strategies, so it is relevant to compare these. We also deem that turning the categorical features into numerical ones could have an effect, in a the sense that assuming a linear effect of these features could improve performance.

For testing purposes we then compare these on the normal training task, and also on tasks with either our interest weighting or our frequency weighting.

```{r,eval=FALSE}
learner_BM <- benchmark_grid(
  tasks = list(train_task, 
               add_weight(train_task,weighting = "interest"),
               add_weight(train_task,weighting = "frequency"),
               add_weight(train_task)),
  learners = list(Dumm_lrn, 
                  Target_lrn,
                  Dummy_lrn_SocCat,
                  Target_lrn_SocCat),
  resamplings = rsmp("cv",folds=3))|> 
  benchmark()
```


## Glmnet, Gam and ensemble learner

In this section we will investigate the glmnet and gam learners. We will also consider an ensemble learner, where the output from other learners, namely a gam learner and a gradient boosting learner, are combined into one. The definition of the learners as well as the benchmarking can be found in the GlmAndGams.R file in the github repository.

Note, the naming of the encoding are slightly different as opposed to those described in section Encoding Strategies. Here pipes between encodings are indicated by a ".", and the naming of different encodings are as follows,

<dl class="dl-horizontal">
<dt>df:</dt>
<dd>Indicate missing. Remove RecordEnd. Encode with datefeatures.</dd>
<dt>num:</dt> 
<dd>Indicate missing. Impute RecordEnd with last day of the year. Transform to numeric.</dd>
<dt>adhoc:</dt>
<dd>The adhoc encodings described in section "Encoding of remaining covariates".</dd>
<dt>dummy:</dt>
<dd>dummy encoding of remaining factors.</dd>
<dt>impact:</dt>
<dd>impact encoding of remaining factors.</dd>
</dl>


As in other sections these learners are added to different combinations of encoding strategies and then benchmarked. The predictions are scored using the normal mean square error between predictions and truth (regr.mse), as well as out custom measure (mse_inter). The results from a 5-fold crossvalidation can be seen in the following plot.

```{r, fig.width = 16, fig.height= 9, echo=FALSE}
readRDS("GnG_score_cv5.rds") |> 
  pivot_longer(cols = c("regr.mse","mse_inter"),
               names_to = "measure")|>
  ggplot()+
  geom_boxplot(
    aes(y = value,
        x = learner,
        fill = encoding),
    position = position_dodge2(preserve = "single"))+
  facet_grid(measure ~ task_id, scales = "free_y")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

```
In the comparison it is obvious that the choice of measure is indeed important for model selection. A much larger variance is in particular seen for the regular mse for the weighting with interest (ie. the sigmoid transformation of exposure) as opposed to the mse for just exposure = 1 (mse_inter), and this weighting seems way more favorable in the light of the latter measure. From the above plot either no weighting or interest weighting seems more favorable, perhaps with a slight edge for the weighted one. 

For the encodings, using the numeric encoding of the dates as opposed to the default datefeatures encoding also seems preferable. The effect of the remaining encodings are hard to distinguish in this plot. 

For the learners the main thing to note seems to be that tuning build into the cv_glmnet learner fares worse for this dataset compared to the auto tuning of the s parameter, even though this practice is discouraged in the help page for the learner.

In the following plot, the average performance is considered. Here the mse measures have been plotted against the time to train the model. These plots are made for each combination of measure and model and the gridded. Note the different scales.

```{r, fig.width = 12, fig.height= 9, echo=FALSE}
aggr<-readRDS("GnG_aggr_cv5.rds")
aggr |> 
  pivot_longer(cols = c("regr.mse","mse_inter"),
               names_to = "measure")|>
  filter(encoding != "regr.featureless")|>
  ggplot()+
    geom_hline(aes(yintercept = value,
                   color = encoding,
                   linetype = task_id),
               linewidth = 0.3)+
    geom_point(aes(x = time_train,
                   y = value,
                   color = encoding,
                   shape = task_id),
               size = 3)+
    facet_grid(measure ~ learner, scale = "free")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")
```
From this plot we note again that the interest weighting performs better in the eyes of the custom measure, where no weighting takes a lead when considering the traditional mse. The ensemble model seems to have slight edge in performance, but is quite close to that of the gam and the auto tuned glmnet. A stark difference in training time between the auto tuned glmnet and the remaing models is however to be noted, with the extra training time yielding no visible performance gain compared to gam and ensemble models. 

A final thing to note is the choice of encodings, as the num.impact encoding seems to lead to the fastest training times of those with the best predictive performance. Thus the winner's of this experiment would be the ensemble and gam learners with the num.impact encoding.


## Gradient Boost

We wanted to work with XGBoost because it is a widely used ML-model, which often has fairly strong predictive power. 

There are a lot of hyperparameters to tune, and we select two of them. The first is the choice of booster, particularly 'dart', 'gbtree' and 'gblinear' where the first two are tree-based and the third is a linear function approximation. The second is the max_depth of the tree, in the case that the learner is tree-based, and this is tested between 1 and 500 at a step-length of 50. We use 5-fold cross-validation for hyper-parameter tuning.

In order to make a choice of weighting and encoding, as described above, we use all 4 combinations of weighting (no weight, frequency, interest and both) and all 4 factors (dummy,target,numeric dummy and numeric target). The results are as follows:

```{r, echo = FALSE}
xg_learner = po_VehAge_num %>>% po_VehPrice_int %>>% po_SocCat_int %>>% po("encode") %>>% po_RecBeg_num %>>% po_RecEnd_rc %>>% po("scale") %>>% lrn("regr.xgboost",booster=to_tune(c("gbtree", "gblinear", "dart")),
              max_depth=to_tune(floor(seq(1,500,length.out=10)))) |> at_create()

bench_mark = readRDS("xgbench 10-fold")
#plot(bench_mark)

#ggplot(bench_mark$aggregate(list(msr("time_train"),msr("regr.mse"))))+
#  geom_point(mapping=aes(x=time_train,y=regr.mse,colour=learner_id,shape=task_id))+
#  geom_hline(mapping=aes(yintercept=regr.mse,colour=learner_id,linetype=task_id))
```

```{r, echo = FALSE}
bench_mark$score(msrs(c("regr.mse","regr.mse_inter")))|>
pivot_longer(cols = c("regr.mse","mse_inter"),
               names_to = "measure")|>
  ggplot()+
  geom_boxplot(
    aes(y = value,
        x = learner_id),
    position = position_dodge2(preserve = "single"))+
  facet_grid(measure ~ task_id, scales = "free_y")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

```{r, echo = FALSE}
bench_mark$aggregate(msrs(c("regr.mse","regr.mse_inter","time_train")))|>
  pivot_longer(cols = c("regr.mse","mse_inter"),names_to = "measure")|>
ggplot()+
  geom_hline(mapping = aes(yintercept = value, 
                           color = learner_id),
             linetype = "dashed")+
  geom_point(mapping = aes(x=time_train, 
                           y=value, 
                           color= learner_id, 
                           shape = task_id),
             size = 3)+
  xlab("time")+
  ylab("Mean Squared Error")+
  facet_wrap(~measure,scales = "free")+
  theme(legend.position = "bottom",
        legend.direction =  "horizontal",
        legend.box = "vertical")

```

Where the first plot is a histogram of the MSE at the different folds, and the second plot illustrates the average mse for the different combinations, with training time in seconds.

It can be seen that dummy encoding generally gives higher run-time. More importantly, dummy also tends to result in lower average MSE, but the variance is comparable across all encoding strategies according to the boxplot. We can see both a lower mean and variance in the interest-weighting strategy, although the median is higher here. These values should be taken with a grain of salt since it is only based on 5 fold cv and has no chance of converging on representable values.

Across both MSE measures, it is clear that interest weighting is preferable, but there is no clear choice of encoder. For the MSE_interest, direct target encoding gives better performance, but for regular MSE numeric encoding scores lower (better). Since direct target is good in both cases, and we are most interested in the MSE as measured by MSE_interest, we choose the direct target encoding.

5-fold cross validation tells us that we ought to have a max_depth of around 300 and use the gbtree booster.


## Random Forest

We choose to do a random forest algorithm because we wanted to implement a tree based method. Since we are interested in the predictive performance of our models we chose to do a random forest implementation since it should have better performance than a decision tree, and should have lower variance than bagging.

The random forest implementation has some well-performing standard hyper-parameters, so we deemed it not necessary to tune the hyper parameters. 

```{r, echo = FALSE}
load("RforestBM5.RData")
rforest_BM5$filter(learner_ids = list("dmy_Rfor","dmy_Rfor_cus","Trgt_Rfor","Trgt_Rfor_cus"))
```

```{r, echo = FALSE}
rforest_BM5$score(msrs(c("regr.mse","regr.mse_inter")))|>
pivot_longer(cols = c("regr.mse","mse_inter"),
               names_to = "measure")|>
  ggplot()+
  geom_boxplot(
    aes(y = value,
        x = learner_id,
        fill = learner_id),
    position = position_dodge2(preserve = "single"))+
  facet_grid(measure ~ task_id, scales = "free_y")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
We again see the same pattern as we did before, where the variance is dependent on which measure we use to score. We firstly note that the values we observe with the MSE is smaller than with the MSE_inter, which makes comparing the two measures a little more complicated. But in general the pattern is that the variance for the MSE_inter is smaller for learners trained on the tasks with frequency weighting and interest weighting, and the regular MSE har smaller variance for the learners trained on the task with both weighting implementations. Comparing the variance on the unweighted does seems to indicate smaller variance for MSE_inter, but our dummy encoding seems to have smaller variance for the MSE.

Furthermore from our boxplot we can see a few more things. Firstly we remark that the performance on the unweighted task scores significantly better when we using the regular MSE. Which of the other learners performs the best we can't really tell because of the difference in the scales, to get a better understanding of the performances we plot the score measures against how long they took to run. 
```{r, echo = FALSE}
rforest_BM5$aggregate(msrs(c("regr.mse","regr.mse_inter","time_train")))|>
  pivot_longer(cols = c("regr.mse","mse_inter"),names_to = "measure")|>
ggplot()+
  geom_hline(mapping = aes(yintercept = value, 
                           color = learner_id),
             linetype = "dashed")+
  geom_point(mapping = aes(x=time_train, 
                           y=value, 
                           color= learner_id, 
                           shape = task_id),
             size = 3)+
  xlab("time")+
  ylab("Mean Squared Error")+
  facet_wrap(~measure,scales = "free")+
  theme(legend.position = "bottom",
        legend.direction =  "horizontal",
        legend.box = "vertical")
```
We note that we still have different scales but these scales are much more comparable. The first thing we notice is that in general that the dummy encoding is slowest and the worst performing. Furthermore it seems that our own encodings seems to have a positive impact when using the dummy encoding, where our own custom dummy encoding improves both the performance and the computational time.

But what is clear from this plot is that it is target encoding of some kind which is the best ranger learner. We first note that our own encodings doesn't seem to have that big of impact on the performance of the random forest learner, but the important thing for performance is how we have weighted our training task. THe computaion time for the target encoding implementations are very similar, with only target encoding usually being slightly faster, but this difference is so small that it is not significant when choosing which implementation we deem the best. When looking at the MSE_inter plot we have a clear best option being our numeric target encoding, but we see that it's score it still higher than most of the scores we get when usign MSE. Other things to note is that the by far second best option is just pure target encoding on frequency weighting, and then the target encoders on the interest weighting task is also clearly the second best, but the scores of the two encodings are near identical. When we look at the performance on the regular MSE we have four encodings as the best performing. The worst of these is target encoding on the interest weighted training task, then it's the numeric target encoding on the untrained traning task. But the two best are also almost the same, and these are numeric target encoding on the interest weighted training task and the target encoding on the unweighted training task. So when considering the performance across both measure and computaional time we conclude that the numeric target encoding on the interest weighted training task is the best of our random forest learners.

# Comparing learners

## Within the training set

Before training the chosen learners on the full training set and evaluating them on the testing set, the best performing learner from each of the previous section are compared against each other. From the Glmnet, Gam and Ensemble section the interest weighted Gam, Ensemble and Xgboost learners with impact encoding is chosen, as these performed best under our measure of interest, while for the Random Forrest the adhoc encoding will be used as well.

The four learners are compared using a 10 fold cross validation, and the results can be seen in the following plots.

```{r, echo = FALSE}
readRDS("bw_train_comp_scor.rds") |> 
  pivot_longer(cols = c("regr.mse","mse_inter"),
               names_to = "measure")|>
  ggplot()+
  geom_boxplot(
    aes(y = value,
        x = learner_id),
    position = position_dodge2(preserve = "single"))+
  facet_wrap( ~ measure,scale = "free")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")
```


```{r, echo = FALSE}
readRDS("bw_train_comp_aggr.rds")|> 
  pivot_longer(cols = c("regr.mse","mse_inter"),
               names_to = "measure")|>
  ggplot()+
    geom_hline(aes(yintercept = value,
                   color = learner_id),
                   linetype = "dashed",
               linewidth = 0.3)+
    geom_point(aes(x = time_train,
                   y = value,
                   color = learner_id),
               size = 3)+
    facet_wrap(~measure, scale = "free")+
  theme(legend.position = "bottom")
```

Overall it seems that the random forrest learner yields the best predictive performance, thus we would expect this to perform best, when trained on the entire training set and tested on the testing set.

## On the testing set

Having studied the four learners on training set, we are now ready to train them on the entire training set and test them on the testing set. The results of which can be seen in the table below. A ranking column has been added after each measure, these should be read as in sports, ie. first place, second place etc.

```{r,echo = FALSE}
readRDS("prediction_results.rds") |>
  mutate(ppr_inter = rank(mse_inter),
         ppr_mse = rank(regr.mse),
         r_tt = rank(time_train))|>
  select(c("learner","mse_inter","ppr_inter","regr.mse","ppr_mse","time_train","r_tt"))

```

Here we might note, that while the gradient boosting model outperforms both the gam and ensemble, when predictions of the entire test set is considered, it performs worse than the gam on the measure of interest, which beats ensemble learner as well under this measure, even though it had the worst predictive performance when considering any exposure. The big winner in predictive performance both regardless of exposure restriction is as expected the random forrest, while still not being the slowest to train, this is however probably mostly a result of the hyperparameter tuning in the xgboost learner.

Now, in the table below we consider the different learners predictions for the specified indices on the test set, which all have zero as their true response. These indicies are 11386, 12286, 2119, 2238, 27833 and 27988. They all have not made any insurance claim in 2004, and have a claim amount of 0. The mean squared error for these 6 observations can be seen ind the mse column.

```{r,echo = FALSE}
readRDS("specific_predictions.rds")
```
Now, the xgboost have by far the best predictive performance on the observations, however as seen in the call below, none of these observations have exposure equal to 1. Thus the performance ranking seems inline with the ones described in the earlier table.

```{r}
test$Exposure[c(11386, 12286, 2119, 2238, 27833, 27988)]
```

It could be interesting to try predicting the Claimamount for these individuals, in the case that they did make a claim. In the same prediction, we see how well the different algorithms capture the fact that there is claim being made. First we will set Exposure to 1 and change RecordEnd appropriately. A copy of this will be added where the claim indicator have been changed as well. The resulting predictions can be seen in the figure below.

```{r,echo = FALSE}
what<- readRDS("what_if.rds")
what
for_plot <- pivot_longer(what,cols=3:8,names_to="Individual",values_to="Predicted_Claimamount")
for_plot$ClaimInd<-for_plot$ClaimInd |> as.factor()
ggplot(for_plot)+geom_point(mapping=aes(x=learner,y=Predicted_Claimamount,shape=ClaimInd,color=Individual))+ylab("Predicted claim amount")
```

Now a bit worryingly the models do not capture that ClaimInd = 0 correlates completely with a non-positive claim amount, while xgboost does seem to be closest to capturing this, the gam and ensemble models are the worst in this regard. The xgboost does however generally predict a lot lower compared to the other models, which are bit more similar in their predictions particular when ClaimInd = 1, at least for these observations.

# Discussion

We have seen that many different approaches to learning for insurance data can be viable. Particularly we can see that weighting can lead to lower prediction errors, and the choice of weighting has a big influence on the fitted model, with different performance depending on the model. Similarly the way data is encoded has some impact, but not nearly as significant as weighting, but here it can also be hard to evaluate a priori which type of encoding is more appropriate. Additionally it is difficult (impossible) to learn why some combination of encodings, learners and weights give the performance they do.

Based on MSE, we found that random forest has the best predictive performance. It is however quite a bit slower than the glmnet and gam learners, and when we look at individual predictions it seems like they are in the same ballpark. XGBoost, on the other hand, better reflects the structure in the data, in the sense it accurately captures that people don't get a payout if they don't make a claim. The reason why it captures this might just be because it is somehow trained to give lower predictions.

Now approaches might be considered. Perhaps one might include ClaimInd as part of the response, as this might be closer to the real world setting, where this is not known a priori. One might used a staged approach, where one first predicts wether a customer will give a claim or not, and then use this to split the data in to different branches. Here a bigger knowledge of why some ClaimAmounts are negative, ie. a better understanding of the datagenerating process, might help design a flow, that better predicts the ClaimAmount.