Introduction

During 1974 to 1984, 424 PBC patients referred to the Mayo Clinic qualified for the randomized placebocontrolled trial testing the drug D-penicillamine. Of these, the initial 312 patients took part in the trial and have mostly comprehensive data. The remaining 112 patients didn’t join the clinical trial but agreed to record basic metrics and undergo survival tracking. Six of these patients were soon untraceable after their diagnosis, leaving data for 106 of these individuals in addition to the 312 who were part of the randomized trial.

Study Problem Statement

The objective of this assignment is to predict the survival state of patients with liver cirrhosis.

Libraries

For this assignment, the following libraries are used:

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggcorrplot)
library(mlpack) # Used to create the different models used in this assignment

## 
## Attaching package: 'mlpack'

## The following object is masked from 'package:stats':
## 
##     kmeans

## The following object is masked from 'package:base':
## 
##     det

library(caret) # Used to obtain all the metrics of the different models

## Loading required package: lattice

Data

This assignment uses the data frame Cirrhosis dataset that can be found in the DATASET folder.

dataset <- read.csv("../datasets/cirrhosis.csv")

Data preparation

Let’s study the dataset to see which data will be useful to our study. First of all let’s transform to factor all the character data to see in the summary all the NA’s.

temp_dataset <- dataset
temp_dataset[, c("Status", "Drug", "Sex", "Ascites", "Hepatomegaly", "Spiders", "Edema")] <- lapply(temp_dataset[, c("Status", "Drug", "Sex", "Ascites", "Hepatomegaly", "Spiders", "Edema")], as.factor)
summary(temp_dataset)

##        ID            N_Days     Status                Drug          Age       
##  Min.   :  1.0   Min.   :  41   C :232   D-penicillamine:158   Min.   : 9598  
##  1st Qu.:105.2   1st Qu.:1093   CL: 25   Placebo        :154   1st Qu.:15644  
##  Median :209.5   Median :1730   D :161   NA's           :106   Median :18628  
##  Mean   :209.5   Mean   :1918                                  Mean   :18533  
##  3rd Qu.:313.8   3rd Qu.:2614                                  3rd Qu.:21272  
##  Max.   :418.0   Max.   :4795                                  Max.   :28650  
##                                                                               
##  Sex     Ascites    Hepatomegaly Spiders    Edema     Bilirubin     
##  F:374   N   :288   N   :152     N   :222   N:354   Min.   : 0.300  
##  M: 44   Y   : 24   Y   :160     Y   : 90   S: 44   1st Qu.: 0.800  
##          NA's:106   NA's:106     NA's:106   Y: 20   Median : 1.400  
##                                                     Mean   : 3.221  
##                                                     3rd Qu.: 3.400  
##                                                     Max.   :28.000  
##                                                                     
##   Cholesterol        Albumin          Copper          Alk_Phos      
##  Min.   : 120.0   Min.   :1.960   Min.   :  4.00   Min.   :  289.0  
##  1st Qu.: 249.5   1st Qu.:3.243   1st Qu.: 41.25   1st Qu.:  871.5  
##  Median : 309.5   Median :3.530   Median : 73.00   Median : 1259.0  
##  Mean   : 369.5   Mean   :3.497   Mean   : 97.65   Mean   : 1982.7  
##  3rd Qu.: 400.0   3rd Qu.:3.770   3rd Qu.:123.00   3rd Qu.: 1980.0  
##  Max.   :1775.0   Max.   :4.640   Max.   :588.00   Max.   :13862.4  
##  NA's   :134                      NA's   :108      NA's   :106      
##       SGOT        Tryglicerides      Platelets      Prothrombin   
##  Min.   : 26.35   Min.   : 33.00   Min.   : 62.0   Min.   : 9.00  
##  1st Qu.: 80.60   1st Qu.: 84.25   1st Qu.:188.5   1st Qu.:10.00  
##  Median :114.70   Median :108.00   Median :251.0   Median :10.60  
##  Mean   :122.56   Mean   :124.70   Mean   :257.0   Mean   :10.73  
##  3rd Qu.:151.90   3rd Qu.:151.00   3rd Qu.:318.0   3rd Qu.:11.10  
##  Max.   :457.25   Max.   :598.00   Max.   :721.0   Max.   :18.00  
##  NA's   :106      NA's   :136      NA's   :11      NA's   :2      
##      Stage      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :3.024  
##  3rd Qu.:4.000  
##  Max.   :4.000  
##  NA's   :6

As we can see there are a lot of missing values, and that is a problem that we have to solve in some way.

Another important thing to do previous to deal with the NA’s is seeing the type of the data:

str(dataset)

## 'data.frame':    418 obs. of  20 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ N_Days       : int  400 4500 1012 1925 1504 2503 1832 2466 2400 51 ...
##  $ Status       : chr  "D" "C" "D" "D" ...
##  $ Drug         : chr  "D-penicillamine" "D-penicillamine" "D-penicillamine" "D-penicillamine" ...
##  $ Age          : int  21464 20617 25594 19994 13918 24201 20284 19379 15526 25772 ...
##  $ Sex          : chr  "F" "F" "M" "F" ...
##  $ Ascites      : chr  "Y" "N" "N" "N" ...
##  $ Hepatomegaly : chr  "Y" "Y" "N" "Y" ...
##  $ Spiders      : chr  "Y" "Y" "N" "Y" ...
##  $ Edema        : chr  "Y" "N" "S" "S" ...
##  $ Bilirubin    : num  14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
##  $ Cholesterol  : int  261 302 176 244 279 248 322 280 562 200 ...
##  $ Albumin      : num  2.6 4.14 3.48 2.54 3.53 3.98 4.09 4 3.08 2.74 ...
##  $ Copper       : int  156 54 210 64 143 50 52 52 79 140 ...
##  $ Alk_Phos     : num  1718 7395 516 6122 671 ...
##  $ SGOT         : num  137.9 113.5 96.1 60.6 113.2 ...
##  $ Tryglicerides: int  172 88 55 92 72 63 213 189 88 143 ...
##  $ Platelets    : int  190 221 151 183 136 NA 204 373 251 302 ...
##  $ Prothrombin  : num  12.2 10.6 12 10.3 10.9 11 9.7 11 11 11.5 ...
##  $ Stage        : int  4 3 4 4 3 3 3 3 2 4 ...

Let’s simplify the name of the dataset to df and also to keep the original without modifications.

df <- dataset

We can remove the ID from the dataset because it’s a value that is used only to index the data and has no other important relation with the rest.

df <- subset(df, select = -ID)

As it is said in the introduction, there are some patients that didn’t took neither the drug or the placebo statement, and those values figure as NA’s so let’s get rid of them changing them to None.

df <- df %>% mutate(Drug = ifelse(is.na(Drug), "None", Drug))

Once we got rid of the NA’s in Drug we can plot it to see that like it was said in the introduction, there are 106 people that didn’t took anything.

ggplot(df, aes(x = Drug)) +
  geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, color = "black", size = 4) +
  labs(title = "Drugs taken", x = "Drug", y = "Count")

So now let’s clean all the data getting rid of the NA’s. In this case when it’s a chr data we change the NA for None, when is a numeric we use the mean of that variable except for Stage that we use the median because it does not make sense to have half a stage.

df <- df %>% mutate(Ascites = ifelse(is.na(Ascites), "None", Ascites))
df <- df %>% mutate(Hepatomegaly = ifelse(is.na(Hepatomegaly), "None", Hepatomegaly))
df <- df %>% mutate(Spiders = ifelse(is.na(Spiders), "None", Spiders))
df <- df %>% mutate(Cholesterol = ifelse(is.na(Cholesterol), mean(Cholesterol, na.rm = TRUE), Cholesterol))
df <- df %>% mutate(Alk_Phos = ifelse(is.na(Alk_Phos), mean(Alk_Phos, na.rm = TRUE), Alk_Phos))
df <- df %>% mutate(SGOT = ifelse(is.na(SGOT), mean(SGOT, na.rm = TRUE), SGOT))
df <- df %>% mutate(Copper = ifelse(is.na(Copper), mean(Copper, na.rm = TRUE), Copper))
df <- df %>% mutate(Tryglicerides = ifelse(is.na(Tryglicerides), mean(Tryglicerides, na.rm = TRUE), Tryglicerides))
df <- df %>% mutate(Platelets = ifelse(is.na(Platelets), mean(Platelets, na.rm = TRUE), Platelets))
df <- df %>% mutate(Prothrombin = ifelse(is.na(Prothrombin), mean(Prothrombin, na.rm = TRUE), Prothrombin))
df <- df %>% mutate(Stage = ifelse(is.na(Stage), median(Stage, na.rm = TRUE), Stage))
summary(df)

##      N_Days        Status              Drug                Age       
##  Min.   :  41   Length:418         Length:418         Min.   : 9598  
##  1st Qu.:1093   Class :character   Class :character   1st Qu.:15644  
##  Median :1730   Mode  :character   Mode  :character   Median :18628  
##  Mean   :1918                                         Mean   :18533  
##  3rd Qu.:2614                                         3rd Qu.:21272  
##  Max.   :4795                                         Max.   :28650  
##      Sex              Ascites          Hepatomegaly         Spiders         
##  Length:418         Length:418         Length:418         Length:418        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Edema             Bilirubin       Cholesterol        Albumin     
##  Length:418         Min.   : 0.300   Min.   : 120.0   Min.   :1.960  
##  Class :character   1st Qu.: 0.800   1st Qu.: 273.0   1st Qu.:3.243  
##  Mode  :character   Median : 1.400   Median : 369.5   Median :3.530  
##                     Mean   : 3.221   Mean   : 369.5   Mean   :3.497  
##                     3rd Qu.: 3.400   3rd Qu.: 369.5   3rd Qu.:3.770  
##                     Max.   :28.000   Max.   :1775.0   Max.   :4.640  
##      Copper          Alk_Phos          SGOT        Tryglicerides  
##  Min.   :  4.00   Min.   :  289   Min.   : 26.35   Min.   : 33.0  
##  1st Qu.: 51.25   1st Qu.: 1016   1st Qu.: 91.00   1st Qu.: 95.0  
##  Median : 97.65   Median : 1717   Median :122.56   Median :124.7  
##  Mean   : 97.65   Mean   : 1983   Mean   :122.56   Mean   :124.7  
##  3rd Qu.:100.75   3rd Qu.: 1983   3rd Qu.:135.75   3rd Qu.:127.8  
##  Max.   :588.00   Max.   :13862   Max.   :457.25   Max.   :598.0  
##    Platelets      Prothrombin        Stage      
##  Min.   : 62.0   Min.   : 9.00   Min.   :1.000  
##  1st Qu.:190.0   1st Qu.:10.00   1st Qu.:2.000  
##  Median :253.0   Median :10.60   Median :3.000  
##  Mean   :257.0   Mean   :10.73   Mean   :3.024  
##  3rd Qu.:315.5   3rd Qu.:11.10   3rd Qu.:4.000  
##  Max.   :721.0   Max.   :18.00   Max.   :4.000

Now that all the data is cleaned we can start to look to other things, for example the Status of the patients, and it is main feature that we would use in this study to help us answering to the question raised above.

ggplot(df, aes(x = Status)) +
  geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, color = "black", size = 4) +
  labs(title = "Status", x = "Status", y = "Count")

df$AgeInYear <- df$Age / 365.25 # 366
df$AgeInYear

##   [1] 58.76523 56.44627 70.07255 54.74059 38.10541 66.25873 55.53457 53.05681
##   [9] 42.50787 70.55989 53.71389 59.13758 45.68925 56.22177 64.64613 40.44353
##  [17] 52.18344 53.93018 49.56057 59.95346 64.18891 56.27652 55.96715 44.52019
##  [25] 45.07324 52.02464 54.43943 44.94730 63.87680 41.38535 41.55236 53.99589
##  [33] 51.28268 52.06023 48.61875 56.41068 61.72758 36.62697 55.39220 46.66940
##  [41] 33.63450 33.69473 48.87064 37.58248 41.79329 45.79877 47.42779 49.13621
##  [49] 61.15264 53.50856 52.08761 50.54073 67.40862 39.19781 65.76318 33.61807
##  [57] 53.57153 44.56947 40.39425 58.38193 43.89870 60.70637 46.62834 62.90760
##  [65] 40.20260 46.45311 51.28816 32.61328 49.33881 56.39973 48.84600 32.49281
##  [73] 38.49418 51.92060 43.51814 51.94251 49.82615 47.94524 46.51608 67.41136
##  [81] 63.26352 67.31006 56.01369 55.83025 47.21697 52.75838 37.27858 41.39357
##  [89] 52.44353 33.47570 45.60712 76.70910 36.53388 53.91650 46.39014 48.84600
##  [97] 71.89322 28.88433 48.46817 51.46886 44.95003 56.56947 48.96372 43.01711
## [105] 34.03970 68.50924 62.52156 50.35729 44.06297 38.91034 41.15264 55.45791
## [113] 51.23340 52.82683 42.63929 61.07050 49.65640 48.85421 54.25599 35.15127
## [121] 67.90691 55.43600 45.82067 52.88980 47.18138 53.59890 44.10404 41.94935
## [129] 63.61396 44.22724 62.00137 40.55305 62.64476 42.33539 42.96783 55.96167
## [137] 62.86105 51.24983 46.76249 54.07529 47.03628 55.72621 46.10267 52.28747
## [145] 51.20055 33.86448 75.01164 30.86379 61.80424 34.98700 55.04175 69.94114
## [153] 49.60438 69.37714 43.55647 59.40862 48.75838 36.49281 45.76044 57.37166
## [161] 42.74333 58.81725 53.49760 43.41410 53.30595 41.35524 60.95825 47.75359
## [169] 35.49076 48.66256 52.66804 49.86995 30.27515 55.56742 52.15332 41.60986
## [177] 55.45243 70.00411 43.94251 42.56810 44.56947 56.94456 40.26010 37.60712
## [185] 48.36140 70.83641 35.79192 62.62286 50.64750 54.52704 52.69268 52.72005
## [193] 56.77207 44.39699 29.55510 57.04038 44.62697 35.79740 40.71732 32.23272
## [201] 41.09240 61.63997 37.05681 62.57906 48.97741 61.99042 72.77207 61.29500
## [209] 52.62423 49.76318 52.91444 47.26352 50.20397 69.34702 41.16906 59.16496
## [217] 36.07940 34.59548 42.71321 63.63039 56.62971 46.26420 61.24298 38.62012
## [225] 38.77070 56.69541 58.95140 36.92266 62.41478 34.60917 58.33539 50.18207
## [233] 42.68583 34.37919 33.18275 38.38193 59.76181 66.41205 46.78987 56.07940
## [241] 41.37440 64.57221 67.48802 44.82957 45.77139 32.95003 41.22108 55.41684
## [249] 47.98084 40.79124 56.97467 68.46270 78.43943 39.85763 35.31006 31.44422
## [257] 58.26420 51.48802 59.96988 74.52430 52.36413 42.78713 34.87474 44.13963
## [265] 46.38193 56.30938 70.90760 55.39493 45.08419 26.27789 50.47228 38.39836
## [273] 47.41958 47.98084 38.31622 50.10815 35.08830 32.50376 56.15332 46.15469
## [281] 65.88364 33.94387 62.86105 48.56400 46.34908 38.85284 58.64750 48.93634
## [289] 67.57290 65.98494 40.90075 50.24504 57.19644 60.53662 35.35113 31.38125
## [297] 55.98631 52.72553 38.09172 58.17112 45.21013 37.79877 60.65982 35.53457
## [305] 43.06639 56.39151 30.57358 61.18275 58.29979 62.33265 37.99863 33.15264
## [313] 60.00000 64.99932 54.00137 75.00068 62.00137 43.00068 46.00137 44.00000
## [321] 60.99932 64.00000 40.00000 63.00068 34.00137 52.00000 48.99932 54.00137
## [329] 63.00068 54.00137 46.00137 52.99932 56.00000 56.00000 55.00068 64.99932
## [337] 56.00000 47.00068 60.00000 52.99932 54.00137 50.00137 48.00000 36.00000
## [345] 48.00000 70.00137 51.00068 52.00000 54.00137 48.00000 66.00137 52.99932
## [353] 62.00137 59.00068 39.00068 67.00068 58.00137 64.00000 46.00137 64.00000
## [361] 40.99932 48.99932 44.00000 59.00068 63.00068 60.99932 64.00000 48.99932
## [369] 42.00137 50.00137 51.00068 36.99932 62.00137 51.00068 52.00000 44.00000
## [377] 32.99932 60.00000 63.00068 32.99932 40.99932 51.00068 36.99932 59.00068
## [385] 55.00068 54.00137 48.99932 40.00000 67.00068 68.00000 40.99932 68.99932
## [393] 52.00000 56.99932 36.00000 50.00137 64.00000 62.00137 42.00137 44.00000
## [401] 68.99932 52.00000 66.00137 40.00000 52.00000 46.00137 54.00137 51.00068
## [409] 43.00068 39.00068 51.00068 67.00068 35.00068 67.00068 39.00068 56.99932
## [417] 58.00137 52.99932

df$year_group <- cut(df$AgeInYear,
  breaks = seq(0, 100, by = 10),
  include.lowest = TRUE,
  labels = FALSE
)
df$year_group

##   [1] 6 6 8 6 4 7 6 6 5 8 6 6 5 6 7 5 6 6 5 6 7 6 6 5 5 6 6 5 7 5 5 6 6 6 5 6 7
##  [38] 4 6 5 4 4 5 4 5 5 5 5 7 6 6 6 7 4 7 4 6 5 5 6 5 7 5 7 5 5 6 4 5 6 5 4 4 6
##  [75] 5 6 5 5 5 7 7 7 6 6 5 6 4 5 6 4 5 8 4 6 5 5 8 3 5 6 5 6 5 5 4 7 7 6 5 4 5
## [112] 6 6 6 5 7 5 5 6 4 7 6 5 6 5 6 5 5 7 5 7 5 7 5 5 6 7 6 5 6 5 6 5 6 6 4 8 4
## [149] 7 4 6 7 5 7 5 6 5 4 5 6 5 6 6 5 6 5 7 5 4 5 6 5 4 6 6 5 6 8 5 5 5 6 5 4 5
## [186] 8 4 7 6 6 6 6 6 5 3 6 5 4 5 4 5 7 4 7 5 7 8 7 6 5 6 5 6 7 5 6 4 4 5 7 6 5
## [223] 7 4 4 6 6 4 7 4 6 6 5 4 4 4 6 7 5 6 5 7 7 5 5 4 5 6 5 5 6 7 8 4 4 4 6 6 6
## [260] 8 6 5 4 5 5 6 8 6 5 3 6 4 5 5 4 6 4 4 6 5 7 4 7 5 5 4 6 5 7 7 5 6 6 7 4 4
## [297] 6 6 4 6 5 4 7 4 5 6 4 7 6 7 4 4 6 7 6 8 7 5 5 5 7 7 4 7 4 6 5 6 7 6 5 6 6
## [334] 6 6 7 6 5 6 6 6 6 5 4 5 8 6 6 6 5 7 6 7 6 4 7 6 7 5 7 5 5 5 6 7 7 7 5 5 6
## [371] 6 4 7 6 6 5 4 6 7 4 5 6 4 6 6 6 5 4 7 7 5 7 6 6 4 6 7 7 5 5 7 6 7 4 6 5 6
## [408] 6 5 4 6 7 4 7 4 6 6 6

ggplot(df, aes(x = year_group)) +
  geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Number of people per decade", x = "Decade group", y = "Count")

To see which features will be a good contender to be picked to make the prediction let’s make a correlation matrix to see which are more promising.

plot_corr <- ggcorr(
  data = df,
  low = "steelblue",
  mid = "white",
  high = "darkred",
  label = TRUE,
  label_alpha = 0,
  angle = -45,
  max_size = 10,
  min_size = 2,
  size = 3,
  hjust = 0.75
)

## Warning in ggcorr(data = df, low = "steelblue", mid = "white", high =
## "darkred", : data in column(s) 'Status', 'Drug', 'Sex', 'Ascites',
## 'Hepatomegaly', 'Spiders', 'Edema' are not numeric and were ignored

plot_corr +
  theme(text = element_text(size = 8))

As we can see, a combination of all or some of the variables related to medical condition are a good way to try to predict the state of the patient.

stage_proportions <- round(prop.table(table(df$Stage)), 2)
pie(stage_proportions, labels = paste(names(stage_proportions), ": ", stage_proportions), col = rainbow(length(stage_proportions)))
legend("topright", legend = names(stage_proportions), fill = rainbow(length(stage_proportions)), cex = 0.8, title = "Stage")

This plot shows us that the majority of the patients are in the third stage followed for the fourth stage.

Data selection

Seeing all the previous information we have decided to split the data and only pick the variables related to the health state of the patient and his stage.

stage_df <- subset(df, select = c(Stage, Ascites, Hepatomegaly, Spiders, Cholesterol, Alk_Phos, Bilirubin, Albumin, Copper, SGOT, Tryglicerides, Platelets, Prothrombin))

Now we can see the data that will be used to make the predictions.

summary(stage_df)

##      Stage         Ascites          Hepatomegaly         Spiders         
##  Min.   :1.000   Length:418         Length:418         Length:418        
##  1st Qu.:2.000   Class :character   Class :character   Class :character  
##  Median :3.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.024                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :4.000                                                           
##   Cholesterol        Alk_Phos       Bilirubin         Albumin     
##  Min.   : 120.0   Min.   :  289   Min.   : 0.300   Min.   :1.960  
##  1st Qu.: 273.0   1st Qu.: 1016   1st Qu.: 0.800   1st Qu.:3.243  
##  Median : 369.5   Median : 1717   Median : 1.400   Median :3.530  
##  Mean   : 369.5   Mean   : 1983   Mean   : 3.221   Mean   :3.497  
##  3rd Qu.: 369.5   3rd Qu.: 1983   3rd Qu.: 3.400   3rd Qu.:3.770  
##  Max.   :1775.0   Max.   :13862   Max.   :28.000   Max.   :4.640  
##      Copper            SGOT        Tryglicerides     Platelets    
##  Min.   :  4.00   Min.   : 26.35   Min.   : 33.0   Min.   : 62.0  
##  1st Qu.: 51.25   1st Qu.: 91.00   1st Qu.: 95.0   1st Qu.:190.0  
##  Median : 97.65   Median :122.56   Median :124.7   Median :253.0  
##  Mean   : 97.65   Mean   :122.56   Mean   :124.7   Mean   :257.0  
##  3rd Qu.:100.75   3rd Qu.:135.75   3rd Qu.:127.8   3rd Qu.:315.5  
##  Max.   :588.00   Max.   :457.25   Max.   :598.0   Max.   :721.0  
##   Prothrombin   
##  Min.   : 9.00  
##  1st Qu.:10.00  
##  Median :10.60  
##  Mean   :10.73  
##  3rd Qu.:11.10  
##  Max.   :18.00

str(stage_df)

## 'data.frame':    418 obs. of  13 variables:
##  $ Stage        : num  4 3 4 4 3 3 3 3 2 4 ...
##  $ Ascites      : chr  "Y" "N" "N" "N" ...
##  $ Hepatomegaly : chr  "Y" "Y" "N" "Y" ...
##  $ Spiders      : chr  "Y" "Y" "N" "Y" ...
##  $ Cholesterol  : num  261 302 176 244 279 248 322 280 562 200 ...
##  $ Alk_Phos     : num  1718 7395 516 6122 671 ...
##  $ Bilirubin    : num  14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
##  $ Albumin      : num  2.6 4.14 3.48 2.54 3.53 3.98 4.09 4 3.08 2.74 ...
##  $ Copper       : num  156 54 210 64 143 50 52 52 79 140 ...
##  $ SGOT         : num  137.9 113.5 96.1 60.6 113.2 ...
##  $ Tryglicerides: num  172 88 55 92 72 63 213 189 88 143 ...
##  $ Platelets    : num  190 221 151 183 136 ...
##  $ Prothrombin  : num  12.2 10.6 12 10.3 10.9 11 9.7 11 11 11.5 ...

With another correlation plot we can see better the relation between the variables selected.

plot_corr <- ggcorr(
  data = stage_df,
  low = "steelblue",
  mid = "white",
  high = "darkred",
  label = TRUE,
  label_alpha = 0,
  angle = -45,
  max_size = 10,
  min_size = 2,
  size = 3,
  hjust = 0.75
)

## Warning in ggcorr(data = stage_df, low = "steelblue", mid = "white", high =
## "darkred", : data in column(s) 'Ascites', 'Hepatomegaly', 'Spiders' are not
## numeric and were ignored

plot_corr +
  theme(text = element_text(size = 8))

Predictions

Now that we found the data that we want to use to predict the survival rate of the patient, we can start the predictions.

Data split

First of all the Status data has to be transformed to be able to use it as labels to see the accuracy of the different models.

labels <- as.numeric(as.factor(df$Status))
labels

##   [1] 3 1 3 3 2 3 1 3 3 3 3 3 1 3 3 1 3 3 1 3 1 3 3 3 1 3 3 3 1 3 3 1 3 1 3 1 3
##  [38] 3 3 1 3 1 1 3 1 3 1 1 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 3 1 3 3 1 3 1 1 1 1 3
##  [75] 3 3 3 3 1 3 3 3 1 1 3 3 3 1 3 3 3 3 1 3 3 1 3 1 1 3 1 1 3 3 2 3 1 3 1 3 2
## [112] 3 3 3 1 1 3 3 3 2 3 1 3 1 2 3 1 3 1 3 3 1 3 1 1 1 1 3 1 1 1 3 3 3 1 1 1 3
## [149] 3 1 1 3 1 3 1 3 1 2 3 1 1 3 3 3 3 1 3 1 3 1 1 1 1 1 1 3 1 1 1 1 1 1 2 3 1
## [186] 3 3 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 1 1 1 1 1 3 3 1 3 1 1 3 1 3
## [223] 3 1 1 1 3 1 3 1 3 1 1 1 1 1 1 1 3 1 2 1 3 3 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1
## [260] 1 1 1 2 2 2 1 3 3 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 2 3 1 2 1 1 1 2 1
## [297] 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 3 1 1 3 1 1 1 1 1 3 3 3 3 3 3
## [334] 3 1 1 3 3 1 1 3 1 3 1 2 3 3 1 1 3 3 1 1 3 1 3 1 1 1 3 2 2 1 3 1 3 1 3 3 3
## [371] 3 1 1 1 2 3 1 3 3 2 1 3 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 3 1
## [408] 1 1 1 1 1 1 3 1 1 1 1

This function makes a split of the data, returning the train and test data and also the train and test labels given a dataset, the labels, the size of the split and the seed that would be used.

train_test_split <- function(x_df, y_labs, test_size, seed) {
  set.seed(seed)
  index <- sample(nrow(x_df), (1 - test_size) * nrow(x_df))

  train_data <- x_df[index, ]
  train_labels <- as.matrix(y_labs[index])

  test_data <- x_df[-index, ]
  test_labels <- as.matrix(y_labs[-index])

  return(list(
    train_data = train_data,
    train_labels = train_labels,
    test_data = test_data,
    test_labels = test_labels
  ))
}

Function used to create and train the different models.

run_classifier <- function(classifier, train_data, train_labels, test_data, test_labels) {
  model <- classifier(training = train_data, labels = train_labels)
  model <- model$output_model

  output <- classifier(input_model = model, test = test_data)
  predictions <- output$predictions

  res <- confusionMatrix(
    factor(predictions, levels = c(1, 2, 3)),
    factor(test_labels)
  )

  return(res)
}

Models

Once the data is prepared, it’s ready to be used in the models. In this section, each model is given three random subsets of the cleaned dataset to classify as one of the class labels. The models are as follows:

Perceptron
Random Forest
Linear SVM
Softmax Regression
Naive Bayes Classifier
Decision Tree
Adaptive Boosting

Each run of the model will output the confusion matrix and the evaluation of each run. Later on, it will be studied and commented on in the Results section. Additionally, the results of each run will be displayed immediately after every model, as shown in the following sections for each model.

Perceptron

percep_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

percep_results <- vector("list", length = length(percep_runs))

for (i in seq_along(percep_runs)) {
  s_ <- percep_runs[[i]][[1]]
  x_ <- percep_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    perceptron,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  percep_results[[i]] <- model_res
}

over_percep_results <- sapply(percep_results, function(result) result$overall)

over_percep_df <- as.data.frame(t(over_percep_results))

print(over_percep_df)

##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.5476190 0.1617647     0.4565147     0.6364336    0.4682540     0.04507228
## 2 0.5634921 0.2289720     0.4723096     0.6516209    0.5079365     0.12329026
## 3 0.5793651 0.0000000     0.4881926     0.6667196    0.5793651     0.53781155
##   McnemarPValue
## 1  0.0004308605
## 2  0.0241862305
## 3           NaN

Random Forest

randft_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

randft_results <- vector("list", length = length(randft_runs))

for (i in seq_along(randft_runs)) {
  s_ <- randft_runs[[i]][[1]]
  x_ <- randft_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    random_forest,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  randft_results[[i]] <- model_res
}

over_randft_results <- sapply(randft_results, function(result) result$overall)

over_randft_df <- as.data.frame(t(over_randft_results))

print(over_randft_df)

##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6587302 0.3726989     0.5689813     0.7408262    0.4682540   1.271571e-05
## 2 0.6507937 0.3220836     0.5607952     0.7335241    0.5079365   8.411121e-04
## 3 0.6746032 0.3494522     0.5854291     0.7553536    0.5793651   1.801934e-02
##   McnemarPValue
## 1   0.003395845
## 2           NaN
## 3   0.233689002

Linear SVM

svm_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

svm_results <- vector("list", length = length(svm_runs))

for (i in seq_along(svm_runs)) {
  s_ <- svm_runs[[i]][[1]]
  x_ <- svm_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    linear_svm,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  svm_results[[i]] <- model_res
}

over_svm_results <- sapply(svm_results, function(result) result$overall)

over_svm_df <- as.data.frame(t(over_svm_results))

print(over_svm_df)

##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.5158730 0.1013679     0.4251861     0.6057968    0.4682540    0.163053612
## 2 0.5952381 0.1953418     0.5041649     0.6817283    0.5079365    0.030370130
## 3 0.7063492 0.3813694     0.6186399     0.7840872    0.5793651    0.002219661
##   McnemarPValue
## 1  0.0000508402
## 2           NaN
## 3  0.0041104971

Softmax Regression

soft_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

soft_results <- vector("list", length = length(soft_runs))

for (i in seq_along(soft_runs)) {
  s_ <- soft_runs[[i]][[1]]
  x_ <- soft_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    softmax_regression,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  soft_results[[i]] <- model_res
}


over_soft_results <- sapply(soft_results, function(result) result$overall)

over_soft_df <- as.data.frame(t(over_soft_results))

print(over_soft_df)

##    Accuracy        Kappa AccuracyLower AccuracyUpper AccuracyNull
## 1 0.4682540  0.000000000     0.3788413     0.5591930    0.4682540
## 2 0.2063492  0.005681818     0.1394428     0.2875473    0.5079365
## 3 0.4920635 -0.068220956     0.4019165     0.5825920    0.5793651
##   AccuracyPValue McnemarPValue
## 1      0.5347941           NaN
## 2      1.0000000  3.591953e-19
## 3      0.9805038  1.690918e-10

Naive Bayes Classifier

nbc_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

nbc_results <- vector("list", length = length(nbc_runs))

for (i in seq_along(nbc_runs)) {
  s_ <- nbc_runs[[i]][[1]]
  x_ <- nbc_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    nbc,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  nbc_results[[i]] <- model_res
}

over_nbc_results <- sapply(nbc_results, function(result) result$overall)

over_nbc_df <- as.data.frame(t(over_nbc_results))

print(over_nbc_df)

##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6746032 0.4112821     0.5854291     0.7553536    0.4682540   2.370435e-06
## 2 0.6825397 0.3823529     0.5936916     0.7625780    0.5079365   5.345444e-05
## 3 0.7301587 0.4585440     0.6438416     0.8053373    0.5793651   3.189246e-04
##   McnemarPValue
## 1    0.03172225
## 2           NaN
## 3    0.02612204

Decision Tree

dec_tree_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

dec_tree_results <- vector("list", length = length(dec_tree_runs))

for (i in seq_along(dec_tree_runs)) {
  s_ <- dec_tree_runs[[i]][[1]]
  x_ <- dec_tree_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    decision_tree,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  dec_tree_results[[i]] <- model_res
}

over_tree_results <- sapply(dec_tree_results, function(result) result$overall)

over_tree_df <- as.data.frame(t(over_tree_results))

print(over_tree_df)

##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6507937 0.3558731     0.5607952     0.7335241    0.4682540   2.799095e-05
## 2 0.6587302 0.3563792     0.5689813     0.7408262    0.5079365   4.440815e-04
## 3 0.6269841 0.2788602     0.5363835     0.7114691    0.5793651   1.604822e-01
##   McnemarPValue
## 1   0.005894062
## 2   0.738933269
## 3   0.065931882

Adaptive Boosting

adaboost_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))

adaboost_results <- vector("list", length = length(adaboost_runs))

for (i in seq_along(adaboost_runs)) {
  s_ <- adaboost_runs[[i]][[1]]
  x_ <- adaboost_runs[[i]][[2]]

  res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)

  model_res <- run_classifier(
    adaboost,
    res$train_data,
    res$train_labels,
    res$test_data,
    res$test_labels
  )

  adaboost_results[[i]] <- model_res
}

over_ada_results <- sapply(adaboost_results, function(result) result$overall)

over_ada_df <- as.data.frame(t(over_ada_results))

print(over_ada_df)

##    Accuracy     Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6507937 0.3585561     0.5607952     0.7335241    0.4682540   2.799095e-05
## 2 0.6666667 0.3476331     0.5771925     0.7481028    0.5079365   2.268179e-04
## 3 0.6587302 0.3208824     0.5689813     0.7408262    0.5793651   4.215226e-02
##   McnemarPValue
## 1    0.01033895
## 2           NaN
## 3    0.11030268

Results

In this section, we will display and discuss the results obtained from the models.

Function to get the best model out of the list of results

get_best_model <- function(model_results_list) {
  accuracies <- sapply(model_results_list, function(result) result$overall[1])
  best_model_index <- which.max(accuracies)
  return(model_results_list[[best_model_index]])
}

confmat_plot <- function(confusion_df, ttl) {
  plot <- ggplot(confusion_df, aes(x = Reference, y = Prediction)) +
    geom_tile(aes(fill = Freq), color = "white") +
    geom_text(aes(label = Freq), vjust = 1) +
    scale_fill_gradient(low = "red", high = "blue") +
    theme_minimal() +
    labs(
      title = ttl,
      x = "Label",
      y = "Prediction"
    )
  plot
}

grid.arrange(
  confmat_plot(data.frame(get_best_model(percep_results)$table), "Perceptron"),
  confmat_plot(data.frame(get_best_model(randft_results)$table), "Random Forest"),
  confmat_plot(data.frame(get_best_model(svm_results)$table), "SVM"),
  confmat_plot(data.frame(get_best_model(soft_results)$table), "Softmax"),
  confmat_plot(data.frame(get_best_model(nbc_results)$table), "Naive Bayes"),
  confmat_plot(data.frame(get_best_model(dec_tree_results)$table), "Decision Tree"),
  confmat_plot(data.frame(get_best_model(adaboost_results)$table), "Adaptive Boost."),
  ncol = 3
)

models_results <- list(
  percep_results, randft_results, svm_results,
  soft_results, nbc_results, dec_tree_results, adaboost_results
)

accurracies <- vector("list", length = length(models_results))
idx_1 <- 1
for (model_results in models_results) {
  model_accurracy <- vector("list", length = length(model_results))
  idx_2 <- 1
  for (result in model_results) {
    model_accurracy[[idx_2]] <- result$overall[1]
    idx_2 <- idx_2 + 1
  }
  accurracies[[idx_1]] <- model_accurracy
  idx_1 <- idx_1 + 1
}

accuracy_model1 <- unlist(accurracies[[1]])
accuracy_model2 <- unlist(accurracies[[2]])
accuracy_model3 <- unlist(accurracies[[3]])
accuracy_model4 <- unlist(accurracies[[4]])
accuracy_model5 <- unlist(accurracies[[5]])
accuracy_model6 <- unlist(accurracies[[6]])
accuracy_model7 <- unlist(accurracies[[7]])

accuracy_df <- data.frame(
  Model = rep(c("Perceptron", "Random Forest", "SVM", "Softmax", "Naive Bayes", "Decision Tree", "AdaBoost"), each = length(accuracy_model1)),
  Accuracy = c(accuracy_model1, accuracy_model2, accuracy_model3, accuracy_model4, accuracy_model5, accuracy_model6, accuracy_model7)
)

ggplot(accuracy_df, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_boxplot() +
  labs(title = "Grouped Boxplot of Accuracy", y = "Accuracy") +
  theme_minimal()

As observed, the accuracy of the models varies significantly between different models, as expected, beacuse each model has a different perspective and treatment of the data and are trained on a random subset of the cleaned dataset. It’s noteworthy that the Naive Bayes model has as the best-performing one, with a slight lead in accuracy. Additionally, the accuracy values remain relatively constant, unlike the Softmax model, which shows significant instability. The Adaptive Boosting and Random Forest models stands out with one of the highest accuracies and denote a remarkable consistency.

To conclude, it’s important to emphasize that determining the best model goes beyond the accuracy score alone. The choice depends on the model’s intended purpose, and it is essential to consider whether a more stable model is preferable or if one with higher variability and a better score is more suitable.

Assignment 3 - Linear models for Cirrhosis dataset

Sergi Mayol y Toni Garri

2023-12-07