During 1974 to 1984, 424 PBC patients referred to the Mayo Clinic qualified for the randomized placebocontrolled trial testing the drug D-penicillamine. Of these, the initial 312 patients took part in the trial and have mostly comprehensive data. The remaining 112 patients didn’t join the clinical trial but agreed to record basic metrics and undergo survival tracking. Six of these patients were soon untraceable after their diagnosis, leaving data for 106 of these individuals in addition to the 312 who were part of the randomized trial.
The objective of this assignment is to predict the survival state of patients with liver cirrhosis.
For this assignment, the following libraries are used:
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggcorrplot)
library(mlpack) # Used to create the different models used in this assignment
##
## Attaching package: 'mlpack'
## The following object is masked from 'package:stats':
##
## kmeans
## The following object is masked from 'package:base':
##
## det
library(caret) # Used to obtain all the metrics of the different models
## Loading required package: lattice
This assignment uses the data frame Cirrhosis dataset that can be found in the DATASET folder.
dataset <- read.csv("../datasets/cirrhosis.csv")
Let’s study the dataset to see which data will be useful to our study. First of all let’s transform to factor all the character data to see in the summary all the NA’s.
temp_dataset <- dataset
temp_dataset[, c("Status", "Drug", "Sex", "Ascites", "Hepatomegaly", "Spiders", "Edema")] <- lapply(temp_dataset[, c("Status", "Drug", "Sex", "Ascites", "Hepatomegaly", "Spiders", "Edema")], as.factor)
summary(temp_dataset)
## ID N_Days Status Drug Age
## Min. : 1.0 Min. : 41 C :232 D-penicillamine:158 Min. : 9598
## 1st Qu.:105.2 1st Qu.:1093 CL: 25 Placebo :154 1st Qu.:15644
## Median :209.5 Median :1730 D :161 NA's :106 Median :18628
## Mean :209.5 Mean :1918 Mean :18533
## 3rd Qu.:313.8 3rd Qu.:2614 3rd Qu.:21272
## Max. :418.0 Max. :4795 Max. :28650
##
## Sex Ascites Hepatomegaly Spiders Edema Bilirubin
## F:374 N :288 N :152 N :222 N:354 Min. : 0.300
## M: 44 Y : 24 Y :160 Y : 90 S: 44 1st Qu.: 0.800
## NA's:106 NA's:106 NA's:106 Y: 20 Median : 1.400
## Mean : 3.221
## 3rd Qu.: 3.400
## Max. :28.000
##
## Cholesterol Albumin Copper Alk_Phos
## Min. : 120.0 Min. :1.960 Min. : 4.00 Min. : 289.0
## 1st Qu.: 249.5 1st Qu.:3.243 1st Qu.: 41.25 1st Qu.: 871.5
## Median : 309.5 Median :3.530 Median : 73.00 Median : 1259.0
## Mean : 369.5 Mean :3.497 Mean : 97.65 Mean : 1982.7
## 3rd Qu.: 400.0 3rd Qu.:3.770 3rd Qu.:123.00 3rd Qu.: 1980.0
## Max. :1775.0 Max. :4.640 Max. :588.00 Max. :13862.4
## NA's :134 NA's :108 NA's :106
## SGOT Tryglicerides Platelets Prothrombin
## Min. : 26.35 Min. : 33.00 Min. : 62.0 Min. : 9.00
## 1st Qu.: 80.60 1st Qu.: 84.25 1st Qu.:188.5 1st Qu.:10.00
## Median :114.70 Median :108.00 Median :251.0 Median :10.60
## Mean :122.56 Mean :124.70 Mean :257.0 Mean :10.73
## 3rd Qu.:151.90 3rd Qu.:151.00 3rd Qu.:318.0 3rd Qu.:11.10
## Max. :457.25 Max. :598.00 Max. :721.0 Max. :18.00
## NA's :106 NA's :136 NA's :11 NA's :2
## Stage
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :3.024
## 3rd Qu.:4.000
## Max. :4.000
## NA's :6
As we can see there are a lot of missing values, and that is a problem that we have to solve in some way.
Another important thing to do previous to deal with the NA’s is seeing the type of the data:
str(dataset)
## 'data.frame': 418 obs. of 20 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ N_Days : int 400 4500 1012 1925 1504 2503 1832 2466 2400 51 ...
## $ Status : chr "D" "C" "D" "D" ...
## $ Drug : chr "D-penicillamine" "D-penicillamine" "D-penicillamine" "D-penicillamine" ...
## $ Age : int 21464 20617 25594 19994 13918 24201 20284 19379 15526 25772 ...
## $ Sex : chr "F" "F" "M" "F" ...
## $ Ascites : chr "Y" "N" "N" "N" ...
## $ Hepatomegaly : chr "Y" "Y" "N" "Y" ...
## $ Spiders : chr "Y" "Y" "N" "Y" ...
## $ Edema : chr "Y" "N" "S" "S" ...
## $ Bilirubin : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ Cholesterol : int 261 302 176 244 279 248 322 280 562 200 ...
## $ Albumin : num 2.6 4.14 3.48 2.54 3.53 3.98 4.09 4 3.08 2.74 ...
## $ Copper : int 156 54 210 64 143 50 52 52 79 140 ...
## $ Alk_Phos : num 1718 7395 516 6122 671 ...
## $ SGOT : num 137.9 113.5 96.1 60.6 113.2 ...
## $ Tryglicerides: int 172 88 55 92 72 63 213 189 88 143 ...
## $ Platelets : int 190 221 151 183 136 NA 204 373 251 302 ...
## $ Prothrombin : num 12.2 10.6 12 10.3 10.9 11 9.7 11 11 11.5 ...
## $ Stage : int 4 3 4 4 3 3 3 3 2 4 ...
Let’s simplify the name of the dataset to df
and also to keep the original without modifications.
df <- dataset
We can remove the ID from the dataset because it’s a value that is used only to index the data and has no other important relation with the rest.
df <- subset(df, select = -ID)
As it is said in the introduction, there are some patients that didn’t took neither the drug or the placebo statement, and those values figure as NA’s so let’s get rid of them changing them to None
.
df <- df %>% mutate(Drug = ifelse(is.na(Drug), "None", Drug))
Once we got rid of the NA’s in Drug
we can plot it to see that like it was said in the introduction, there are 106 people that didn’t took anything.
ggplot(df, aes(x = Drug)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, color = "black", size = 4) +
labs(title = "Drugs taken", x = "Drug", y = "Count")
So now let’s clean all the data getting rid of the NA’s. In this case when it’s a chr data we change the NA for None
, when is a numeric we use the mean of that variable except for Stage that we use the median because it does not make sense to have half a stage.
df <- df %>% mutate(Ascites = ifelse(is.na(Ascites), "None", Ascites))
df <- df %>% mutate(Hepatomegaly = ifelse(is.na(Hepatomegaly), "None", Hepatomegaly))
df <- df %>% mutate(Spiders = ifelse(is.na(Spiders), "None", Spiders))
df <- df %>% mutate(Cholesterol = ifelse(is.na(Cholesterol), mean(Cholesterol, na.rm = TRUE), Cholesterol))
df <- df %>% mutate(Alk_Phos = ifelse(is.na(Alk_Phos), mean(Alk_Phos, na.rm = TRUE), Alk_Phos))
df <- df %>% mutate(SGOT = ifelse(is.na(SGOT), mean(SGOT, na.rm = TRUE), SGOT))
df <- df %>% mutate(Copper = ifelse(is.na(Copper), mean(Copper, na.rm = TRUE), Copper))
df <- df %>% mutate(Tryglicerides = ifelse(is.na(Tryglicerides), mean(Tryglicerides, na.rm = TRUE), Tryglicerides))
df <- df %>% mutate(Platelets = ifelse(is.na(Platelets), mean(Platelets, na.rm = TRUE), Platelets))
df <- df %>% mutate(Prothrombin = ifelse(is.na(Prothrombin), mean(Prothrombin, na.rm = TRUE), Prothrombin))
df <- df %>% mutate(Stage = ifelse(is.na(Stage), median(Stage, na.rm = TRUE), Stage))
summary(df)
## N_Days Status Drug Age
## Min. : 41 Length:418 Length:418 Min. : 9598
## 1st Qu.:1093 Class :character Class :character 1st Qu.:15644
## Median :1730 Mode :character Mode :character Median :18628
## Mean :1918 Mean :18533
## 3rd Qu.:2614 3rd Qu.:21272
## Max. :4795 Max. :28650
## Sex Ascites Hepatomegaly Spiders
## Length:418 Length:418 Length:418 Length:418
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Edema Bilirubin Cholesterol Albumin
## Length:418 Min. : 0.300 Min. : 120.0 Min. :1.960
## Class :character 1st Qu.: 0.800 1st Qu.: 273.0 1st Qu.:3.243
## Mode :character Median : 1.400 Median : 369.5 Median :3.530
## Mean : 3.221 Mean : 369.5 Mean :3.497
## 3rd Qu.: 3.400 3rd Qu.: 369.5 3rd Qu.:3.770
## Max. :28.000 Max. :1775.0 Max. :4.640
## Copper Alk_Phos SGOT Tryglicerides
## Min. : 4.00 Min. : 289 Min. : 26.35 Min. : 33.0
## 1st Qu.: 51.25 1st Qu.: 1016 1st Qu.: 91.00 1st Qu.: 95.0
## Median : 97.65 Median : 1717 Median :122.56 Median :124.7
## Mean : 97.65 Mean : 1983 Mean :122.56 Mean :124.7
## 3rd Qu.:100.75 3rd Qu.: 1983 3rd Qu.:135.75 3rd Qu.:127.8
## Max. :588.00 Max. :13862 Max. :457.25 Max. :598.0
## Platelets Prothrombin Stage
## Min. : 62.0 Min. : 9.00 Min. :1.000
## 1st Qu.:190.0 1st Qu.:10.00 1st Qu.:2.000
## Median :253.0 Median :10.60 Median :3.000
## Mean :257.0 Mean :10.73 Mean :3.024
## 3rd Qu.:315.5 3rd Qu.:11.10 3rd Qu.:4.000
## Max. :721.0 Max. :18.00 Max. :4.000
Now that all the data is cleaned we can start to look to other things, for example the Status
of the patients, and it is main feature that we would use in this study to help us answering to the question raised above.
ggplot(df, aes(x = Status)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5, color = "black", size = 4) +
labs(title = "Status", x = "Status", y = "Count")
df$AgeInYear <- df$Age / 365.25 # 366
df$AgeInYear
## [1] 58.76523 56.44627 70.07255 54.74059 38.10541 66.25873 55.53457 53.05681
## [9] 42.50787 70.55989 53.71389 59.13758 45.68925 56.22177 64.64613 40.44353
## [17] 52.18344 53.93018 49.56057 59.95346 64.18891 56.27652 55.96715 44.52019
## [25] 45.07324 52.02464 54.43943 44.94730 63.87680 41.38535 41.55236 53.99589
## [33] 51.28268 52.06023 48.61875 56.41068 61.72758 36.62697 55.39220 46.66940
## [41] 33.63450 33.69473 48.87064 37.58248 41.79329 45.79877 47.42779 49.13621
## [49] 61.15264 53.50856 52.08761 50.54073 67.40862 39.19781 65.76318 33.61807
## [57] 53.57153 44.56947 40.39425 58.38193 43.89870 60.70637 46.62834 62.90760
## [65] 40.20260 46.45311 51.28816 32.61328 49.33881 56.39973 48.84600 32.49281
## [73] 38.49418 51.92060 43.51814 51.94251 49.82615 47.94524 46.51608 67.41136
## [81] 63.26352 67.31006 56.01369 55.83025 47.21697 52.75838 37.27858 41.39357
## [89] 52.44353 33.47570 45.60712 76.70910 36.53388 53.91650 46.39014 48.84600
## [97] 71.89322 28.88433 48.46817 51.46886 44.95003 56.56947 48.96372 43.01711
## [105] 34.03970 68.50924 62.52156 50.35729 44.06297 38.91034 41.15264 55.45791
## [113] 51.23340 52.82683 42.63929 61.07050 49.65640 48.85421 54.25599 35.15127
## [121] 67.90691 55.43600 45.82067 52.88980 47.18138 53.59890 44.10404 41.94935
## [129] 63.61396 44.22724 62.00137 40.55305 62.64476 42.33539 42.96783 55.96167
## [137] 62.86105 51.24983 46.76249 54.07529 47.03628 55.72621 46.10267 52.28747
## [145] 51.20055 33.86448 75.01164 30.86379 61.80424 34.98700 55.04175 69.94114
## [153] 49.60438 69.37714 43.55647 59.40862 48.75838 36.49281 45.76044 57.37166
## [161] 42.74333 58.81725 53.49760 43.41410 53.30595 41.35524 60.95825 47.75359
## [169] 35.49076 48.66256 52.66804 49.86995 30.27515 55.56742 52.15332 41.60986
## [177] 55.45243 70.00411 43.94251 42.56810 44.56947 56.94456 40.26010 37.60712
## [185] 48.36140 70.83641 35.79192 62.62286 50.64750 54.52704 52.69268 52.72005
## [193] 56.77207 44.39699 29.55510 57.04038 44.62697 35.79740 40.71732 32.23272
## [201] 41.09240 61.63997 37.05681 62.57906 48.97741 61.99042 72.77207 61.29500
## [209] 52.62423 49.76318 52.91444 47.26352 50.20397 69.34702 41.16906 59.16496
## [217] 36.07940 34.59548 42.71321 63.63039 56.62971 46.26420 61.24298 38.62012
## [225] 38.77070 56.69541 58.95140 36.92266 62.41478 34.60917 58.33539 50.18207
## [233] 42.68583 34.37919 33.18275 38.38193 59.76181 66.41205 46.78987 56.07940
## [241] 41.37440 64.57221 67.48802 44.82957 45.77139 32.95003 41.22108 55.41684
## [249] 47.98084 40.79124 56.97467 68.46270 78.43943 39.85763 35.31006 31.44422
## [257] 58.26420 51.48802 59.96988 74.52430 52.36413 42.78713 34.87474 44.13963
## [265] 46.38193 56.30938 70.90760 55.39493 45.08419 26.27789 50.47228 38.39836
## [273] 47.41958 47.98084 38.31622 50.10815 35.08830 32.50376 56.15332 46.15469
## [281] 65.88364 33.94387 62.86105 48.56400 46.34908 38.85284 58.64750 48.93634
## [289] 67.57290 65.98494 40.90075 50.24504 57.19644 60.53662 35.35113 31.38125
## [297] 55.98631 52.72553 38.09172 58.17112 45.21013 37.79877 60.65982 35.53457
## [305] 43.06639 56.39151 30.57358 61.18275 58.29979 62.33265 37.99863 33.15264
## [313] 60.00000 64.99932 54.00137 75.00068 62.00137 43.00068 46.00137 44.00000
## [321] 60.99932 64.00000 40.00000 63.00068 34.00137 52.00000 48.99932 54.00137
## [329] 63.00068 54.00137 46.00137 52.99932 56.00000 56.00000 55.00068 64.99932
## [337] 56.00000 47.00068 60.00000 52.99932 54.00137 50.00137 48.00000 36.00000
## [345] 48.00000 70.00137 51.00068 52.00000 54.00137 48.00000 66.00137 52.99932
## [353] 62.00137 59.00068 39.00068 67.00068 58.00137 64.00000 46.00137 64.00000
## [361] 40.99932 48.99932 44.00000 59.00068 63.00068 60.99932 64.00000 48.99932
## [369] 42.00137 50.00137 51.00068 36.99932 62.00137 51.00068 52.00000 44.00000
## [377] 32.99932 60.00000 63.00068 32.99932 40.99932 51.00068 36.99932 59.00068
## [385] 55.00068 54.00137 48.99932 40.00000 67.00068 68.00000 40.99932 68.99932
## [393] 52.00000 56.99932 36.00000 50.00137 64.00000 62.00137 42.00137 44.00000
## [401] 68.99932 52.00000 66.00137 40.00000 52.00000 46.00137 54.00137 51.00068
## [409] 43.00068 39.00068 51.00068 67.00068 35.00068 67.00068 39.00068 56.99932
## [417] 58.00137 52.99932
df$year_group <- cut(df$AgeInYear,
breaks = seq(0, 100, by = 10),
include.lowest = TRUE,
labels = FALSE
)
df$year_group
## [1] 6 6 8 6 4 7 6 6 5 8 6 6 5 6 7 5 6 6 5 6 7 6 6 5 5 6 6 5 7 5 5 6 6 6 5 6 7
## [38] 4 6 5 4 4 5 4 5 5 5 5 7 6 6 6 7 4 7 4 6 5 5 6 5 7 5 7 5 5 6 4 5 6 5 4 4 6
## [75] 5 6 5 5 5 7 7 7 6 6 5 6 4 5 6 4 5 8 4 6 5 5 8 3 5 6 5 6 5 5 4 7 7 6 5 4 5
## [112] 6 6 6 5 7 5 5 6 4 7 6 5 6 5 6 5 5 7 5 7 5 7 5 5 6 7 6 5 6 5 6 5 6 6 4 8 4
## [149] 7 4 6 7 5 7 5 6 5 4 5 6 5 6 6 5 6 5 7 5 4 5 6 5 4 6 6 5 6 8 5 5 5 6 5 4 5
## [186] 8 4 7 6 6 6 6 6 5 3 6 5 4 5 4 5 7 4 7 5 7 8 7 6 5 6 5 6 7 5 6 4 4 5 7 6 5
## [223] 7 4 4 6 6 4 7 4 6 6 5 4 4 4 6 7 5 6 5 7 7 5 5 4 5 6 5 5 6 7 8 4 4 4 6 6 6
## [260] 8 6 5 4 5 5 6 8 6 5 3 6 4 5 5 4 6 4 4 6 5 7 4 7 5 5 4 6 5 7 7 5 6 6 7 4 4
## [297] 6 6 4 6 5 4 7 4 5 6 4 7 6 7 4 4 6 7 6 8 7 5 5 5 7 7 4 7 4 6 5 6 7 6 5 6 6
## [334] 6 6 7 6 5 6 6 6 6 5 4 5 8 6 6 6 5 7 6 7 6 4 7 6 7 5 7 5 5 5 6 7 7 7 5 5 6
## [371] 6 4 7 6 6 5 4 6 7 4 5 6 4 6 6 6 5 4 7 7 5 7 6 6 4 6 7 7 5 5 7 6 7 4 6 5 6
## [408] 6 5 4 6 7 4 7 4 6 6 6
ggplot(df, aes(x = year_group)) +
geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Number of people per decade", x = "Decade group", y = "Count")
To see which features will be a good contender to be picked to make the prediction let’s make a correlation matrix to see which are more promising.
plot_corr <- ggcorr(
data = df,
low = "steelblue",
mid = "white",
high = "darkred",
label = TRUE,
label_alpha = 0,
angle = -45,
max_size = 10,
min_size = 2,
size = 3,
hjust = 0.75
)
## Warning in ggcorr(data = df, low = "steelblue", mid = "white", high =
## "darkred", : data in column(s) 'Status', 'Drug', 'Sex', 'Ascites',
## 'Hepatomegaly', 'Spiders', 'Edema' are not numeric and were ignored
plot_corr +
theme(text = element_text(size = 8))
As we can see, a combination of all or some of the variables related to medical condition are a good way to try to predict the state of the patient.
stage_proportions <- round(prop.table(table(df$Stage)), 2)
pie(stage_proportions, labels = paste(names(stage_proportions), ": ", stage_proportions), col = rainbow(length(stage_proportions)))
legend("topright", legend = names(stage_proportions), fill = rainbow(length(stage_proportions)), cex = 0.8, title = "Stage")
This plot shows us that the majority of the patients are in the third stage followed for the fourth stage.
Seeing all the previous information we have decided to split the data and only pick the variables related to the health state of the patient and his stage.
stage_df <- subset(df, select = c(Stage, Ascites, Hepatomegaly, Spiders, Cholesterol, Alk_Phos, Bilirubin, Albumin, Copper, SGOT, Tryglicerides, Platelets, Prothrombin))
Now we can see the data that will be used to make the predictions.
summary(stage_df)
## Stage Ascites Hepatomegaly Spiders
## Min. :1.000 Length:418 Length:418 Length:418
## 1st Qu.:2.000 Class :character Class :character Class :character
## Median :3.000 Mode :character Mode :character Mode :character
## Mean :3.024
## 3rd Qu.:4.000
## Max. :4.000
## Cholesterol Alk_Phos Bilirubin Albumin
## Min. : 120.0 Min. : 289 Min. : 0.300 Min. :1.960
## 1st Qu.: 273.0 1st Qu.: 1016 1st Qu.: 0.800 1st Qu.:3.243
## Median : 369.5 Median : 1717 Median : 1.400 Median :3.530
## Mean : 369.5 Mean : 1983 Mean : 3.221 Mean :3.497
## 3rd Qu.: 369.5 3rd Qu.: 1983 3rd Qu.: 3.400 3rd Qu.:3.770
## Max. :1775.0 Max. :13862 Max. :28.000 Max. :4.640
## Copper SGOT Tryglicerides Platelets
## Min. : 4.00 Min. : 26.35 Min. : 33.0 Min. : 62.0
## 1st Qu.: 51.25 1st Qu.: 91.00 1st Qu.: 95.0 1st Qu.:190.0
## Median : 97.65 Median :122.56 Median :124.7 Median :253.0
## Mean : 97.65 Mean :122.56 Mean :124.7 Mean :257.0
## 3rd Qu.:100.75 3rd Qu.:135.75 3rd Qu.:127.8 3rd Qu.:315.5
## Max. :588.00 Max. :457.25 Max. :598.0 Max. :721.0
## Prothrombin
## Min. : 9.00
## 1st Qu.:10.00
## Median :10.60
## Mean :10.73
## 3rd Qu.:11.10
## Max. :18.00
str(stage_df)
## 'data.frame': 418 obs. of 13 variables:
## $ Stage : num 4 3 4 4 3 3 3 3 2 4 ...
## $ Ascites : chr "Y" "N" "N" "N" ...
## $ Hepatomegaly : chr "Y" "Y" "N" "Y" ...
## $ Spiders : chr "Y" "Y" "N" "Y" ...
## $ Cholesterol : num 261 302 176 244 279 248 322 280 562 200 ...
## $ Alk_Phos : num 1718 7395 516 6122 671 ...
## $ Bilirubin : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ Albumin : num 2.6 4.14 3.48 2.54 3.53 3.98 4.09 4 3.08 2.74 ...
## $ Copper : num 156 54 210 64 143 50 52 52 79 140 ...
## $ SGOT : num 137.9 113.5 96.1 60.6 113.2 ...
## $ Tryglicerides: num 172 88 55 92 72 63 213 189 88 143 ...
## $ Platelets : num 190 221 151 183 136 ...
## $ Prothrombin : num 12.2 10.6 12 10.3 10.9 11 9.7 11 11 11.5 ...
With another correlation plot we can see better the relation between the variables selected.
plot_corr <- ggcorr(
data = stage_df,
low = "steelblue",
mid = "white",
high = "darkred",
label = TRUE,
label_alpha = 0,
angle = -45,
max_size = 10,
min_size = 2,
size = 3,
hjust = 0.75
)
## Warning in ggcorr(data = stage_df, low = "steelblue", mid = "white", high =
## "darkred", : data in column(s) 'Ascites', 'Hepatomegaly', 'Spiders' are not
## numeric and were ignored
plot_corr +
theme(text = element_text(size = 8))
Now that we found the data that we want to use to predict the survival rate of the patient, we can start the predictions.
First of all the Status data has to be transformed to be able to use it as labels to see the accuracy of the different models.
labels <- as.numeric(as.factor(df$Status))
labels
## [1] 3 1 3 3 2 3 1 3 3 3 3 3 1 3 3 1 3 3 1 3 1 3 3 3 1 3 3 3 1 3 3 1 3 1 3 1 3
## [38] 3 3 1 3 1 1 3 1 3 1 1 3 3 3 3 3 3 3 3 3 1 3 1 1 3 3 3 1 3 3 1 3 1 1 1 1 3
## [75] 3 3 3 3 1 3 3 3 1 1 3 3 3 1 3 3 3 3 1 3 3 1 3 1 1 3 1 1 3 3 2 3 1 3 1 3 2
## [112] 3 3 3 1 1 3 3 3 2 3 1 3 1 2 3 1 3 1 3 3 1 3 1 1 1 1 3 1 1 1 3 3 3 1 1 1 3
## [149] 3 1 1 3 1 3 1 3 1 2 3 1 1 3 3 3 3 1 3 1 3 1 1 1 1 1 1 3 1 1 1 1 1 1 2 3 1
## [186] 3 3 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 1 1 1 1 1 3 3 1 3 1 1 3 1 3
## [223] 3 1 1 1 3 1 3 1 3 1 1 1 1 1 1 1 3 1 2 1 3 3 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1
## [260] 1 1 1 2 2 2 1 3 3 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 1 1 1 1 2 3 1 2 1 1 1 2 1
## [297] 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 3 1 1 3 1 1 1 1 1 3 3 3 3 3 3
## [334] 3 1 1 3 3 1 1 3 1 3 1 2 3 3 1 1 3 3 1 1 3 1 3 1 1 1 3 2 2 1 3 1 3 1 3 3 3
## [371] 3 1 1 1 2 3 1 3 3 2 1 3 2 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 3 1
## [408] 1 1 1 1 1 1 3 1 1 1 1
This function makes a split of the data, returning the train and test data and also the train and test labels given a dataset, the labels, the size of the split and the seed that would be used.
train_test_split <- function(x_df, y_labs, test_size, seed) {
set.seed(seed)
index <- sample(nrow(x_df), (1 - test_size) * nrow(x_df))
train_data <- x_df[index, ]
train_labels <- as.matrix(y_labs[index])
test_data <- x_df[-index, ]
test_labels <- as.matrix(y_labs[-index])
return(list(
train_data = train_data,
train_labels = train_labels,
test_data = test_data,
test_labels = test_labels
))
}
Function used to create and train the different models.
run_classifier <- function(classifier, train_data, train_labels, test_data, test_labels) {
model <- classifier(training = train_data, labels = train_labels)
model <- model$output_model
output <- classifier(input_model = model, test = test_data)
predictions <- output$predictions
res <- confusionMatrix(
factor(predictions, levels = c(1, 2, 3)),
factor(test_labels)
)
return(res)
}
Once the data is prepared, it’s ready to be used in the models. In this section, each model is given three random subsets of the cleaned dataset to classify as one of the class labels. The models are as follows:
Each run of the model will output the confusion matrix and the evaluation of each run. Later on, it will be studied and commented on in the Results
section. Additionally, the results of each run will be displayed immediately after every model, as shown in the following sections for each model.
percep_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
percep_results <- vector("list", length = length(percep_runs))
for (i in seq_along(percep_runs)) {
s_ <- percep_runs[[i]][[1]]
x_ <- percep_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
perceptron,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
percep_results[[i]] <- model_res
}
over_percep_results <- sapply(percep_results, function(result) result$overall)
over_percep_df <- as.data.frame(t(over_percep_results))
print(over_percep_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.5476190 0.1617647 0.4565147 0.6364336 0.4682540 0.04507228
## 2 0.5634921 0.2289720 0.4723096 0.6516209 0.5079365 0.12329026
## 3 0.5793651 0.0000000 0.4881926 0.6667196 0.5793651 0.53781155
## McnemarPValue
## 1 0.0004308605
## 2 0.0241862305
## 3 NaN
randft_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
randft_results <- vector("list", length = length(randft_runs))
for (i in seq_along(randft_runs)) {
s_ <- randft_runs[[i]][[1]]
x_ <- randft_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
random_forest,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
randft_results[[i]] <- model_res
}
over_randft_results <- sapply(randft_results, function(result) result$overall)
over_randft_df <- as.data.frame(t(over_randft_results))
print(over_randft_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6587302 0.3726989 0.5689813 0.7408262 0.4682540 1.271571e-05
## 2 0.6507937 0.3220836 0.5607952 0.7335241 0.5079365 8.411121e-04
## 3 0.6746032 0.3494522 0.5854291 0.7553536 0.5793651 1.801934e-02
## McnemarPValue
## 1 0.003395845
## 2 NaN
## 3 0.233689002
svm_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
svm_results <- vector("list", length = length(svm_runs))
for (i in seq_along(svm_runs)) {
s_ <- svm_runs[[i]][[1]]
x_ <- svm_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
linear_svm,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
svm_results[[i]] <- model_res
}
over_svm_results <- sapply(svm_results, function(result) result$overall)
over_svm_df <- as.data.frame(t(over_svm_results))
print(over_svm_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.5158730 0.1013679 0.4251861 0.6057968 0.4682540 0.163053612
## 2 0.5952381 0.1953418 0.5041649 0.6817283 0.5079365 0.030370130
## 3 0.7063492 0.3813694 0.6186399 0.7840872 0.5793651 0.002219661
## McnemarPValue
## 1 0.0000508402
## 2 NaN
## 3 0.0041104971
soft_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
soft_results <- vector("list", length = length(soft_runs))
for (i in seq_along(soft_runs)) {
s_ <- soft_runs[[i]][[1]]
x_ <- soft_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
softmax_regression,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
soft_results[[i]] <- model_res
}
over_soft_results <- sapply(soft_results, function(result) result$overall)
over_soft_df <- as.data.frame(t(over_soft_results))
print(over_soft_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 1 0.4682540 0.000000000 0.3788413 0.5591930 0.4682540
## 2 0.2063492 0.005681818 0.1394428 0.2875473 0.5079365
## 3 0.4920635 -0.068220956 0.4019165 0.5825920 0.5793651
## AccuracyPValue McnemarPValue
## 1 0.5347941 NaN
## 2 1.0000000 3.591953e-19
## 3 0.9805038 1.690918e-10
nbc_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
nbc_results <- vector("list", length = length(nbc_runs))
for (i in seq_along(nbc_runs)) {
s_ <- nbc_runs[[i]][[1]]
x_ <- nbc_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
nbc,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
nbc_results[[i]] <- model_res
}
over_nbc_results <- sapply(nbc_results, function(result) result$overall)
over_nbc_df <- as.data.frame(t(over_nbc_results))
print(over_nbc_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6746032 0.4112821 0.5854291 0.7553536 0.4682540 2.370435e-06
## 2 0.6825397 0.3823529 0.5936916 0.7625780 0.5079365 5.345444e-05
## 3 0.7301587 0.4585440 0.6438416 0.8053373 0.5793651 3.189246e-04
## McnemarPValue
## 1 0.03172225
## 2 NaN
## 3 0.02612204
dec_tree_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
dec_tree_results <- vector("list", length = length(dec_tree_runs))
for (i in seq_along(dec_tree_runs)) {
s_ <- dec_tree_runs[[i]][[1]]
x_ <- dec_tree_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
decision_tree,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
dec_tree_results[[i]] <- model_res
}
over_tree_results <- sapply(dec_tree_results, function(result) result$overall)
over_tree_df <- as.data.frame(t(over_tree_results))
print(over_tree_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6507937 0.3558731 0.5607952 0.7335241 0.4682540 2.799095e-05
## 2 0.6587302 0.3563792 0.5689813 0.7408262 0.5079365 4.440815e-04
## 3 0.6269841 0.2788602 0.5363835 0.7114691 0.5793651 1.604822e-01
## McnemarPValue
## 1 0.005894062
## 2 0.738933269
## 3 0.065931882
adaboost_runs <- list(list(1234, stage_df), list(4567, stage_df), list(2805, stage_df))
adaboost_results <- vector("list", length = length(adaboost_runs))
for (i in seq_along(adaboost_runs)) {
s_ <- adaboost_runs[[i]][[1]]
x_ <- adaboost_runs[[i]][[2]]
res <- train_test_split(x_df = x_, y_labs = labels, test_size = 0.3, seed = s_)
model_res <- run_classifier(
adaboost,
res$train_data,
res$train_labels,
res$test_data,
res$test_labels
)
adaboost_results[[i]] <- model_res
}
over_ada_results <- sapply(adaboost_results, function(result) result$overall)
over_ada_df <- as.data.frame(t(over_ada_results))
print(over_ada_df)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
## 1 0.6507937 0.3585561 0.5607952 0.7335241 0.4682540 2.799095e-05
## 2 0.6666667 0.3476331 0.5771925 0.7481028 0.5079365 2.268179e-04
## 3 0.6587302 0.3208824 0.5689813 0.7408262 0.5793651 4.215226e-02
## McnemarPValue
## 1 0.01033895
## 2 NaN
## 3 0.11030268
In this section, we will display and discuss the results obtained from the models.
Function to get the best model out of the list of results
get_best_model <- function(model_results_list) {
accuracies <- sapply(model_results_list, function(result) result$overall[1])
best_model_index <- which.max(accuracies)
return(model_results_list[[best_model_index]])
}
confmat_plot <- function(confusion_df, ttl) {
plot <- ggplot(confusion_df, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = Freq), color = "white") +
geom_text(aes(label = Freq), vjust = 1) +
scale_fill_gradient(low = "red", high = "blue") +
theme_minimal() +
labs(
title = ttl,
x = "Label",
y = "Prediction"
)
plot
}
grid.arrange(
confmat_plot(data.frame(get_best_model(percep_results)$table), "Perceptron"),
confmat_plot(data.frame(get_best_model(randft_results)$table), "Random Forest"),
confmat_plot(data.frame(get_best_model(svm_results)$table), "SVM"),
confmat_plot(data.frame(get_best_model(soft_results)$table), "Softmax"),
confmat_plot(data.frame(get_best_model(nbc_results)$table), "Naive Bayes"),
confmat_plot(data.frame(get_best_model(dec_tree_results)$table), "Decision Tree"),
confmat_plot(data.frame(get_best_model(adaboost_results)$table), "Adaptive Boost."),
ncol = 3
)
models_results <- list(
percep_results, randft_results, svm_results,
soft_results, nbc_results, dec_tree_results, adaboost_results
)
accurracies <- vector("list", length = length(models_results))
idx_1 <- 1
for (model_results in models_results) {
model_accurracy <- vector("list", length = length(model_results))
idx_2 <- 1
for (result in model_results) {
model_accurracy[[idx_2]] <- result$overall[1]
idx_2 <- idx_2 + 1
}
accurracies[[idx_1]] <- model_accurracy
idx_1 <- idx_1 + 1
}
accuracy_model1 <- unlist(accurracies[[1]])
accuracy_model2 <- unlist(accurracies[[2]])
accuracy_model3 <- unlist(accurracies[[3]])
accuracy_model4 <- unlist(accurracies[[4]])
accuracy_model5 <- unlist(accurracies[[5]])
accuracy_model6 <- unlist(accurracies[[6]])
accuracy_model7 <- unlist(accurracies[[7]])
accuracy_df <- data.frame(
Model = rep(c("Perceptron", "Random Forest", "SVM", "Softmax", "Naive Bayes", "Decision Tree", "AdaBoost"), each = length(accuracy_model1)),
Accuracy = c(accuracy_model1, accuracy_model2, accuracy_model3, accuracy_model4, accuracy_model5, accuracy_model6, accuracy_model7)
)
ggplot(accuracy_df, aes(x = Model, y = Accuracy, fill = Model)) +
geom_boxplot() +
labs(title = "Grouped Boxplot of Accuracy", y = "Accuracy") +
theme_minimal()
As observed, the accuracy of the models varies significantly between different models, as expected, beacuse each model has a different perspective and treatment of the data and are trained on a random subset of the cleaned dataset. It’s noteworthy that the Naive Bayes model has as the best-performing one, with a slight lead in accuracy. Additionally, the accuracy values remain relatively constant, unlike the Softmax model, which shows significant instability. The Adaptive Boosting and Random Forest models stands out with one of the highest accuracies and denote a remarkable consistency.
To conclude, it’s important to emphasize that determining the best model goes beyond the accuracy score alone. The choice depends on the model’s intended purpose, and it is essential to consider whether a more stable model is preferable or if one with higher variability and a better score is more suitable.