Data

This assignment uses the data frame Cirrhosis dataset that can be found in the DATASET folder.

dataset <- read.csv("../datasets/penguindata.csv")

Data preparation

Let’s study the dataset to see which data will be useful to our study. First of all let’s transform to factor the variable sex, a character, to see in the summary all the NA’s.

temp_dataset <- dataset
temp_dataset$sex <- as.factor(temp_dataset$sex)
summary(temp_dataset)

##        X          bill_length_mm  bill_depth_mm   flipper_length_mm
##  Min.   :  1.00   Min.   :32.10   Min.   :13.10   Min.   :172.0    
##  1st Qu.: 86.75   1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0    
##  Median :172.50   Median :44.45   Median :17.30   Median :197.0    
##  Mean   :172.50   Mean   :43.92   Mean   :17.15   Mean   :200.9    
##  3rd Qu.:258.25   3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0    
##  Max.   :344.00   Max.   :59.60   Max.   :21.50   Max.   :231.0    
##                   NA's   :2       NA's   :2       NA's   :2        
##   body_mass_g       sex           year     
##  Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :4050   NA's  : 11   Median :2008  
##  Mean   :4202                Mean   :2008  
##  3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :6300                Max.   :2009  
##  NA's   :2

As we can see there are not a lot of missing values, we could simply remove this data, but in our case let’s transform the data to something more useful.

Another important thing to do previous to deal with the NA’s is seeing the type of the data:

str(dataset)

## 'data.frame':    344 obs. of  7 variables:
##  $ X                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : chr  "male" "female" "female" NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Once we have taken a look to the data we can simplify the name of the dataset to df to also keep the original without modifications.

df <- dataset

We can remove the X from the dataset because it’s a value that is used only to index the data and has no other important relation with the rest.

df <- subset(df, select = -X)

And now let’s clean the data. In this case the variable sex it’s a chr data we change the NA for the median, when is a numeric we use the mean of that variable except for Year that we use the median because it does not make sense to have half a year in which the study was taken.

df <- df %>% mutate(bill_length_mm = ifelse(is.na(bill_length_mm), mean(bill_length_mm, na.rm = TRUE), bill_length_mm))
df <- df %>% mutate(bill_depth_mm = ifelse(is.na(bill_depth_mm), mean(bill_depth_mm, na.rm = TRUE), bill_depth_mm))
df <- df %>% mutate(flipper_length_mm = ifelse(is.na(flipper_length_mm), mean(flipper_length_mm, na.rm = TRUE), flipper_length_mm))
df <- df %>% mutate(body_mass_g = ifelse(is.na(body_mass_g), mean(body_mass_g, na.rm = TRUE), body_mass_g))
df <- df %>% mutate(sex = ifelse(is.na(sex), median(sex, na.rm = TRUE), sex))
df <- df %>% mutate(year = ifelse(is.na(year), median(year, na.rm = TRUE), year))
summary(df)

##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.27   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.25   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300  
##      sex                 year     
##  Length:344         Min.   :2007  
##  Class :character   1st Qu.:2007  
##  Mode  :character   Median :2008  
##                     Mean   :2008  
##                     3rd Qu.:2009  
##                     Max.   :2009

Now let’s change sex and year to a numeric and to a smaller numbers respectively to have the possibility to use them in the clustering.

df$sex <- as.numeric(as.factor(df$sex))
df$year <- as.numeric(as.factor(df$year))

summary(df)

##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.27   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.25   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300  
##       sex            year      
##  Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.00   1st Qu.:1.000  
##  Median :2.00   Median :2.000  
##  Mean   :1.52   Mean   :2.029  
##  3rd Qu.:2.00   3rd Qu.:3.000  
##  Max.   :2.00   Max.   :3.000

str(df)

## 'data.frame':    344 obs. of  6 variables:
##  $ bill_length_mm   : num  39.1 39.5 40.3 43.9 36.7 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 17.2 19.3 ...
##  $ flipper_length_mm: num  181 186 195 201 193 ...
##  $ body_mass_g      : num  3750 3800 3250 4202 3450 ...
##  $ sex              : num  2 1 1 2 1 2 1 2 2 2 ...
##  $ year             : num  1 1 1 1 1 1 1 1 1 1 ...

Models

Once the data is prepared, it’s ready to be used in the models. In this case we have:

Hierarchical:
- AGNES (Agglomerative Nesting)
Partitional:
- K-Means Clustering
- Gaussian Mixture Model
- PAM
Density-Based
- DBSCAN
Ordering-Based
- OPTICS (Ordering Point to Identify the Clustering Structure)

Each run of the model will output the confusion matrix and the evaluation of each run. Later on, it will be studied and commented on in the Results section. Additionally, the results of each run will be displayed immediately after every model, as shown in the following sections for each model.

Before jumping to the models it is important to transform all the data to be the same type, in this case numeric.

penguins <- df
penguins$bill_length_mm <- as.numeric(penguins$bill_length_mm)
penguins$bill_depth_mm <- as.numeric(penguins$bill_depth_mm)
penguins$flipper_length_mm <- as.numeric(penguins$flipper_length_mm)
penguins$body_mass_g <- as.numeric(penguins$body_mass_g)

summary(penguins)

##  bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g  
##  Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700  
##  1st Qu.:39.27   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550  
##  Median :44.25   Median :17.30   Median :197.0     Median :4050  
##  Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202  
##  3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750  
##  Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300  
##       sex            year      
##  Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.00   1st Qu.:1.000  
##  Median :2.00   Median :2.000  
##  Mean   :1.52   Mean   :2.029  
##  3rd Qu.:2.00   3rd Qu.:3.000  
##  Max.   :2.00   Max.   :3.000

str(penguins)

## 'data.frame':    344 obs. of  6 variables:
##  $ bill_length_mm   : num  39.1 39.5 40.3 43.9 36.7 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 17.2 19.3 ...
##  $ flipper_length_mm: num  181 186 195 201 193 ...
##  $ body_mass_g      : num  3750 3800 3250 4202 3450 ...
##  $ sex              : num  2 1 1 2 1 2 1 2 2 2 ...
##  $ year             : num  1 1 1 1 1 1 1 1 1 1 ...

Also is important to scale the features that will be used to make a good clustering.

features <- penguins[, c("sex", "year", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]
scaled_features <- scale(features)

Optimal number of clusters

Once we have the scaled features we can make a little of pre-clustering to see the optimal number of clusters, in this case using the PAM method with the Manhattan distance metric to perform an analysis of the Within-Cluster Sum of Squares (WSS) for different values of k.

fviz_nbclust(
  x = penguins, FUNcluster = pam, method = "wss", k.max = 15,
  diss = dist(penguins, method = "manhattan")
)

In this case we can see that the optimal should be 4 because beyond 5, the trend starts to stabilize, and therefore, there won’t be as significant groupings. So 4 is the number of clusters that we would use initially in each method.

Linkages

Now let’s calculate the Manhattan and the Euclidean distance to make three different linkage methods.

Complete Linkage: Measures the maximum distance between clusters before merging.
Single Linkage: Measures the minimum distance between clusters before merging.
Average Linkage: Measures the average distance between clusters before merging.

matriz_distancias <- dist(x = penguins, method = "euclidean")

hc_euclidea_completo <- hclust(d = matriz_distancias, method = "complete")
hc_euclidea_single <- hclust(d = matriz_distancias, method = "single")
hc_euclidea_average <- hclust(d = matriz_distancias, method = "average")

grid.arrange(
  plot(
    x = hc_euclidea_completo, cex = 0.6, xlab = "", ylab = "", sub = "",
    main = "Complete Linkage"
  ),
  plot(
    x = hc_euclidea_single, cex = 0.6, xlab = "", ylab = "", sub = "",
    main = "Single Linkage"
  ),
  plot(
    x = hc_euclidea_average, cex = 0.6, xlab = "", ylab = "", sub = "",
    main = "Average Linkage"
  ),
  nrow = 2
)

As it can be seen, there are 4 major clusters using the euclidean distance.

matriz_distancias <- dist(x = penguins, method = "manhattan")

hc_manhattan_completo <- hclust(d = matriz_distancias, method = "complete")
hc_manhattan_single <- hclust(d = matriz_distancias, method = "single")
hc_manhattan_average <- hclust(d = matriz_distancias, method = "average")

grid.arrange(
  plot(
    x = hc_manhattan_completo, cex = 0.6, xlab = "", ylab = "", sub = "",
    main = "Linkage complete"
  ),
  plot(
    x = hc_manhattan_single, cex = 0.6, xlab = "", ylab = "", sub = "",
    main = "Linkage single"
  ),
  plot(
    x = hc_manhattan_average, cex = 0.6, xlab = "", ylab = "", sub = "",
    main = "Linkage average"
  ),
  nrow = 2
)

And also using the Manhattan distance it can be said that there are 4 major clusters.

Hierarchical

AGNES Clustering

The idea behind AGNES clustering is to start with each data point as a single cluster and iteratively merge the closest clusters until only one cluster remains

agnes_result <- agnes(scaled_features)

fviz_dend(agnes_result,
  main = "AGNES Clustering Dendrogram",
  show_labels = FALSE,
  palette = "jama",
  k = 4,
  color_labels_by_k = TRUE,
  ggtheme = theme_classic()
)

Now that we tried with 4 clusters we can see that maybe with the AGNES method is not the optimal number of clusters, because of the single value that represent a whole cluster all by himself, so now we can try to “remove” that cluster decreasing the number of clusters that we want, we can try with 3 and 2 just to see the difference.

grid.arrange(
  fviz_dend(agnes_result,
    main = "AGNES Clustering Dendrogram",
    show_labels = FALSE,
    palette = "jama",
    k = 3,
    color_labels_by_k = TRUE,
    ggtheme = theme_classic()
  ),
  fviz_dend(agnes_result,
    main = "AGNES Clustering Dendrogram",
    show_labels = FALSE,
    palette = "jama",
    k = 2,
    color_labels_by_k = TRUE,
    ggtheme = theme_classic()
  ),
  nrow = 2
)

As it can be seen using this method of clustering the “best” value would be 2, we can validate this hypothesis later on when we take a look to the final results.

Partitional

K-Means Clustering

This method divides the dataset into a set number of clusters, this algorithm aims to minimize the variance within each cluster.

Now we can try to make 3 and 4 clusters to compare them.

k1 <- 3
kmeans_result <- kmeans(scaled_features, centers = k1, nstart = 2)
penguins$cluster <- kmeans_result$cluster

plot(penguins[, c(1:6)], col = kmeans_result$cluster)
title("Pair Plot: Scatter Plots of Numeric Variables by Cluster")

As it can be seen with 3 clusters seems like a good option for this data but let’s plot the results in another forms to see it properly.

fviz_cluster(
  object = kmeans_result, data = penguins, show.clust.cent = TRUE,
  ellipse.type = "convex", star.plot = TRUE, repel = TRUE, geom = "point",
) +
  labs(title = "Resultados clustering K-means") +
  theme_bw() +
  theme(legend.position = "none")

This looks very promising but maybe with 4 clusters it would be better.

hclust_result <- hclust(dist(scaled_features))

dend_colored <- color_branches(hclust_result, k = k1)

labels(dend_colored) <- NULL

plot(dend_colored, cex = 0.6, main = "Hierarchical Clustering Dendrogram")
rect.hclust(hclust_result, k1, border = "black")

We get a dendrogram to see how the data is divided and the structure that it follows.

k2 <- 4
kmeans_result2 <- kmeans(scaled_features, centers = k2, nstart = 2)
penguins$cluster <- kmeans_result2$cluster

plot(penguins[, c(1:6)], col = kmeans_result2$cluster)
title("Pair Plot: Scatter Plots of Numeric Variables by Cluster")

With 4 clusters seems better but let’s plot the results in another forms to see it properly.

fviz_cluster(
  object = kmeans_result2, data = penguins, show.clust.cent = TRUE,
  ellipse.type = "convex", star.plot = FALSE, repel = FALSE,
  geom = "point"
) +
  labs(title = "Resultados clustering K-means") +
  theme_bw() +
  theme(legend.position = "none")

As we can see the data is perfectly divided in 4 clusters.

plot_fviz_cluster <- function(features, dt, cents) {
  plot <- fviz_cluster(
    object = kmeans(features, centers = cents),
    data = dt,
    show.clust.cent = TRUE,
    ellipse.type = "convex",
    star.plot = TRUE,
    repel = TRUE,
    geom = "point",
  ) +
    labs(title = paste("K-means ", cents)) +
    theme_bw() +
    theme(legend.position = "none")

  return(plot)
}

grid.arrange(
  plot_fviz_cluster(scaled_features, penguins, 5),
  plot_fviz_cluster(scaled_features, penguins, 6),
  ncol = 2
)

grid.arrange(
  plot_fviz_cluster(scaled_features, penguins, 7),
  plot_fviz_cluster(scaled_features, penguins, 8),
  ncol = 2
)

Here we have it for 5, 6, 7 and 8 clusters and we can see that some date could perfectly be of another cluster showing us that 4 is the optimal and best number of clusters for this data.

hclust_result <- hclust(dist(scaled_features))

dend_colored <- color_branches(hclust_result, k = k2)

labels(dend_colored) <- NULL

plot(dend_colored, cex = 0.6, main = "Hierarchical Clustering Dendrogram")

rect.hclust(hclust_result, k2, border = "black")

Now we can see a perfect representation of the 4 clusters in a dendrogram.

Gaussian Mixture Model Clustering

A Gaussian Mixture Model (GMM) is a statistical model that represents a mixture of several Gaussian distributions (also known as normal distributions). The model assumes that the data is generated by the combination of various Gaussian distributions, and each component of the mixture represents one of the clusters in the data. GMM is a probabilistic modeling technique and is commonly used for clustering tasks.

gmm_result <- Mclust(scaled_features)
plot(gmm_result, what = "classification", main = "GMM Clustering")

As can be observed in this case, there are three different clusters, compared to others that have four. Additionally, in some specific instances in this case, certain clusters can be joined to form one.

PAM Clustering

PAM (Partitioning Around Medoids) is a clustering algorithm that belongs to the category of partitioning methods. Similar to the k-means algorithm, the goal of PAM is to divide a dataset into groups (clusters) in such a way that elements within a cluster are more similar to each other than to elements in other clusters.

The following plots show the PAM clustering with three and four clusters.

pam_result <- pam(dist(scaled_features, method = "euclidean"), 3)
clusplot(pam_result)

pam_result <- pam(dist(scaled_features, method = "euclidean"), 4)
clusplot(pam_result)

As can be seeing in the plots the division of data is better with 4 clusters. It ca be seen clearly in the PAM Clustering k=3 that a cluster can be divided into two.

Density-Based

DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that focuses on the density of points in a feature space to group data. Unlike partitioning methods such as k-means, DBSCAN does not require specifying the number of clusters beforehand and can discover clusters of arbitrary shapes and sizes. Additionally, DBSCAN has the capability to identify outliers or noise.

dbscan_result <- dbscan(scaled_features,
  eps = 1.5,
  MinPts = 4
)
fviz_cluster(
  object = dbscan_result, data = scaled_features, stand = FALSE,
  geom = "point", ellipse = TRUE, show.clust.cent = TRUE,
  pallete = "jco",
  main = "DBSCAN Clustering"
) +
  theme_bw() +
  theme(legend.position = "bottom")

This clustering algorithm clearly separates the data into four clusters. As explained in the DBSCAN description, this algorithm does not require specifying the number of clusters beforehand; it discovers the clusters on its own. Also, as can be seen in the plot, there are two outliers represented by the black dots.

Ordering-Based

OPTICS Clustering

OPTICS (Ordering Points To Identify the Clustering Structure) is a clustering algorithm used to identify cluster structures in a dataset without the need to specify the number of clusters beforehand. OPTICS belongs to the category of density-based clustering methods and is particularly useful for datasets where clusters have variable shapes and sizes.

optics_result <- optics(scaled_features, minPts = 4)
plot(optics_result, main = "OPTICS Clustering")

In this case, with this algorithm, the obtained clusters cannot be clearly identified since the valleys are quite diverse and widely scattered. Nevertheless, it can be observed that there are four peaks that stand out in comparison, indicating the presence of four data clusters.

Conclusion

We can conclude that in this case, the optimal data grouping for the entire dataset is four clusters. This implies that, based on the results obtained with the employed clustering method (such as K-means, DBSCAN, OPTICS, among others), the dataset is organized more coherently into four distinct groups. Each cluster represents a collection of points sharing similarities in their characteristics, and the identification of these groups can be valuable for understanding the underlying structure in the data.

Assignment 4 - Clustering models for Penguindata dataset

Sergi Mayol and Toni Garri

2023-12-13

Introduction

Study Problem Statement

Libraries