Assignment 5 - Data mining for mushroom dataset

Introduction

The dataset contains 8124 descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous (or of unknown edibility and not recommended).

Study problem statement

The objective in this assignment is to mine the data and find association rules that can be used to identify the edibility of a mushroom.

Questions to answer

What are the characteristics of edible mushrooms?
What are the characteristics of poisonous ones?
Are there any redundant rules? Can we remove them?

Libraries

For this assignment, the following libraries are used:

library(GGally)
library(ggplot2)
library(dplyr)
library(arules)
library(arulesViz)
library(gridExtra)

Data

This assignment uses the data frame mushrooms dataset that can be found in the DATASET folder.

dataset <- read.csv("../datasets/mushrooms.csv")

Data preparation

In order to work with the “arules” library, the data has to be in transaction format. In this case we the data needs to be converted into a transaction class.

df <- as(dataset, "transactions")

Now we can check the structure of each transaction.

inspect(df[1:2, ])

##     items                              transactionID
## [1] {type=poisonous,                                
##      cap_shape=convex,                              
##      cap_surface=smooth,                            
##      cap_color=brown,                               
##      bruises=yes,                                   
##      odor=pungent,                                  
##      gill_attachment=free,                          
##      gill_spacing=close,                            
##      gill_size=narrow,                              
##      gill_color=black,                              
##      stalk_shape=enlarging,                         
##      stalk_root=equal,                              
##      stalk_surface_above_ring=smooth,               
##      stalk_surface_below_ring=smooth,               
##      stalk_color_above_ring=white,                  
##      stalk_color_below_ring=white,                  
##      veil_type=partial,                             
##      veil_color=white,                              
##      ring_number=one,                               
##      ring_type=pendant,                             
##      spore_print_color=black,                       
##      population=scattered,                          
##      habitat=urban}                                1
## [2] {type=edible,                                   
##      cap_shape=convex,                              
##      cap_surface=smooth,                            
##      cap_color=yellow,                              
##      bruises=yes,                                   
##      odor=almond,                                   
##      gill_attachment=free,                          
##      gill_spacing=close,                            
##      gill_size=broad,                               
##      gill_color=black,                              
##      stalk_shape=enlarging,                         
##      stalk_root=club,                               
##      stalk_surface_above_ring=smooth,               
##      stalk_surface_below_ring=smooth,               
##      stalk_color_above_ring=white,                  
##      stalk_color_below_ring=white,                  
##      veil_type=partial,                             
##      veil_color=white,                              
##      ring_number=one,                               
##      ring_type=pendant,                             
##      spore_print_color=brown,                       
##      population=numerous,                           
##      habitat=grasses}                              2

The summary is as follows.

summary(df)

## transactions as itemMatrix in sparse format with
##  8124 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.1932773 
## 
## most frequent items:
##    veil_type=partial     veil_color=white gill_attachment=free 
##                 8124                 7924                 7914 
##      ring_number=one   gill_spacing=close              (Other) 
##                 7488                 6812               148590 
## 
## element (itemset/transaction) length distribution:
## sizes
##   23 
## 8124 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      23      23      23      23      23      23 
## 
## includes extended item information - examples:
##           labels variables    levels
## 1    type=edible      type    edible
## 2 type=poisonous      type poisonous
## 3 cap_shape=bell cap_shape      bell
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Let’s make sure it’s in the correct format.

str(df)

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:186852] 1 4 11 12 23 31 34 35 38 39 ...
##   .. .. ..@ p       : int [1:8125] 0 23 46 69 92 115 138 161 184 207 ...
##   .. .. ..@ Dim     : int [1:2] 119 8124
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  119 obs. of  3 variables:
##   .. ..$ labels   : chr [1:119] "type=edible" "type=poisonous" "cap_shape=bell" "cap_shape=conical" ...
##   .. ..$ variables: Factor w/ 23 levels "bruises","cap_color",..: 21 21 3 3 3 3 3 3 4 4 ...
##   .. ..$ levels   : Factor w/ 70 levels "abundant","almond",..: 20 50 5 16 17 27 34 62 24 33 ...
##   ..@ itemsetInfo:'data.frame':  8124 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:8124] "1" "2" "3" "4" ...

In this plot we can see the distribution of items amongs the transactions.

image(sample(df, 100))

Apriori

Apriori was one of the first algorithms developed for the discovery of association rules and continues to be one of the most widely used. It has two stages:

Identify all itemsets that occur with a frequency above a certain threshold (frequent itemsets).
Convert those frequent itemsets into association rules.

Testing the algorithm

rules <- apriori(df)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 812 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 8124 transaction(s)] done [0.01s].
## sorting and recoding items ... [56 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [3.78s].
## writing ... [3315185 rule(s)] done [0.34s].
## creating S4 object  ... done [1.17s].

rules

## set of 3315185 rules

As we can see, if we try to apply the algorithm without any other consideration it gives an astonishing 3.3M rules, so we need to try to reduce them.

Before trying to get better results we can take a look at the summary of this rules:

summary(rules)

## set of 3315185 rules
## 
## rule length distribution (lhs + rhs):sizes
##      1      2      3      4      5      6      7      8      9     10 
##      5    354   5580  35965 132340 325154 579505 781023 809297 645962 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   7.000   8.000   8.091   9.000  10.000 
## 
## summary of quality measures:
##     support         confidence      coverage           lift       
##  Min.   :0.1002   Min.   :0.80   Min.   :0.1002   Min.   :0.8266  
##  1st Qu.:0.1064   1st Qu.:1.00   1st Qu.:0.1064   1st Qu.:1.0849  
##  Median :0.1064   Median :1.00   Median :0.1064   Median :1.7110  
##  Mean   :0.1238   Mean   :0.99   Mean   :0.1253   Mean   :1.8713  
##  3rd Qu.:0.1123   3rd Qu.:1.00   3rd Qu.:0.1182   3rd Qu.:2.1815  
##  Max.   :1.0000   Max.   :1.00   Max.   :1.0000   Max.   :6.8718  
##      count     
##  Min.   : 814  
##  1st Qu.: 864  
##  Median : 864  
##  Mean   :1006  
##  3rd Qu.: 912  
##  Max.   :8124  
## 
## mining info:
##  data ntransactions support confidence               call
##    df          8124     0.1        0.8 apriori(data = df)

Getting results

Here we can find a function named run_apriori_with_loop that we can use to try different configurations of the support values, the confidence values and the minumun lenght values. It returns a data frame that contains the number of rules taking in consideration the redundant ones and not, with every combination possible.

run_apriori_with_loop <- function(df, supp_values, conf_values, minlen_values) {
  result_df <- data.frame()

  for (supp_val in supp_values) {
    for (conf_val in conf_values) {
      for (minlen_val in minlen_values) {
        rules <- apriori(
          df,
          parameter = list(
            supp = supp_val,
            conf = conf_val,
            minlen = minlen_val
          ),
          appearance = list(rhs = c("type=edible", "type=poisonous")),
          control = list(verbose = FALSE)
        )
        red_rules <- rules
        not_red_rules <- rules[!is.redundant(rules)]
        rules_conf <- sort(not_red_rules, by = "lift", decreasing = TRUE)
        result_df <- rbind(
          result_df,
          data.frame(
            Support = supp_val,
            Confidence = conf_val,
            MinLength = minlen_val,
            Redundant_Rules = length(red_rules),
            Non_Redundant_Rules = length(rules_conf)
          )
        )
      }
    }
  }

  return(result_df)
}

supp_values <- c(0.3, 0.4, 0.5)
conf_values <- c(0.6, 0.7, 0.8, 0.9)
minlen_values <- c(1, 2, 3)

result_rules <- run_apriori_with_loop(
  df,
  supp_values,
  conf_values,
  minlen_values
)
result_rules

##    Support Confidence MinLength Redundant_Rules Non_Redundant_Rules
## 1      0.3        0.6         1             786                  73
## 2      0.3        0.6         2             786                  73
## 3      0.3        0.6         3             776                  98
## 4      0.3        0.7         1             686                  68
## 5      0.3        0.7         2             686                  68
## 6      0.3        0.7         3             681                  77
## 7      0.3        0.8         1             576                  53
## 8      0.3        0.8         2             576                  53
## 9      0.3        0.8         3             574                  58
## 10     0.3        0.9         1             312                  29
## 11     0.3        0.9         2             312                  29
## 12     0.3        0.9         3             311                  30
## 13     0.4        0.6         1              38                  10
## 14     0.4        0.6         2              38                  10
## 15     0.4        0.6         3              33                  16
## 16     0.4        0.7         1              16                   7
## 17     0.4        0.7         2              16                   7
## 18     0.4        0.7         3              14                   7
## 19     0.4        0.8         1               4                   2
## 20     0.4        0.8         2               4                   2
## 21     0.4        0.8         3               3                   2
## 22     0.4        0.9         1               4                   2
## 23     0.4        0.9         2               4                   2
## 24     0.4        0.9         3               3                   2
## 25     0.5        0.6         1               0                   0
## 26     0.5        0.6         2               0                   0
## 27     0.5        0.6         3               0                   0
## 28     0.5        0.7         1               0                   0
## 29     0.5        0.7         2               0                   0
## 30     0.5        0.7         3               0                   0
## 31     0.5        0.8         1               0                   0
## 32     0.5        0.8         2               0                   0
## 33     0.5        0.8         3               0                   0
## 34     0.5        0.9         1               0                   0
## 35     0.5        0.9         2               0                   0
## 36     0.5        0.9         3               0                   0

As we can see in the above results there is a very high difference between having removed the redundant rules and not, in this case we think that having 7 rules with 0.7 confidence is good enough to us, because having 10 seems a lot to have into consideration if a mushroom is poisonous or not, and 2 is too also too low.

Let’s take a closer look to this association rules given a support value of 0.4, a confidence values of 0.7 and a minimum lenght of 2.

supp_val <- 0.4
conf_val <- 0.7
minlen_val <- 2

rules_2 <- apriori(
  df,
  parameter = list(supp = supp_val, conf = conf_val, minlen = minlen_val)
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.4      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 3249 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 8124 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.00s].
## writing ... [1454 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules_2)

## set of 1454 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5   6   7 
## 111 396 538 321  81   7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.922   5.000   7.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.4008   Min.   :0.7032   Min.   :0.4008   Min.   :0.8751  
##  1st Qu.:0.4185   1st Qu.:0.9195   1st Qu.:0.4343   1st Qu.:1.0000  
##  Median :0.4530   Median :0.9765   Median :0.4797   Median :1.0252  
##  Mean   :0.4796   Mean   :0.9470   Mean   :0.5087   Mean   :1.0789  
##  3rd Qu.:0.4914   3rd Qu.:1.0000   3rd Qu.:0.5495   3rd Qu.:1.0541  
##  Max.   :0.9754   Max.   :1.0000   Max.   :1.0000   Max.   :1.8649  
##      count     
##  Min.   :3256  
##  1st Qu.:3400  
##  Median :3680  
##  Mean   :3896  
##  3rd Qu.:3992  
##  Max.   :7924  
## 
## mining info:
##  data ntransactions support confidence
##    df          8124     0.4        0.7
##                                                                                         call
##  apriori(data = df, parameter = list(supp = supp_val, conf = conf_val, minlen = minlen_val))

inspect(rules_2[1:6])

##     lhs                        rhs                    support   confidence
## [1] {bruises=yes}           => {gill_spacing=close}   0.4027573 0.9691943 
## [2] {bruises=yes}           => {gill_attachment=free} 0.4155588 1.0000000 
## [3] {bruises=yes}           => {veil_color=white}     0.4155588 1.0000000 
## [4] {bruises=yes}           => {veil_type=partial}    0.4155588 1.0000000 
## [5] {stalk_shape=enlarging} => {gill_attachment=free} 0.4069424 0.9402730 
## [6] {stalk_shape=enlarging} => {veil_color=white}     0.4081733 0.9431172 
##     coverage  lift      count
## [1] 0.4155588 1.1558624 3272 
## [2] 0.4155588 1.0265353 3376 
## [3] 0.4155588 1.0252398 3376 
## [4] 0.4155588 1.0000000 3376 
## [5] 0.4327917 0.9652234 3306 
## [6] 0.4327917 0.9669212 3316

rules_3 <- apriori(
  df,
  parameter = list(supp = supp_val, conf = conf_val, minlen = minlen_val),
  appearance = list(rhs = c("type=edible", "type=poisonous"))
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.4      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 3249 
## 
## set item appearances ...[2 item(s)] done [0.00s].
## set transactions ...[119 item(s), 8124 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.00s].
## writing ... [16 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules3_conf <- sort(rules_3, by = "lift", decreasing = TRUE)
inspect(rules3_conf)

##      lhs                                   rhs                support confidence  coverage     lift count
## [1]  {odor=none}                        => {type=edible}    0.4194978  0.9659864 0.4342688 1.864941  3408
## [2]  {odor=none,                                                                                         
##       veil_type=partial}                => {type=edible}    0.4194978  0.9659864 0.4342688 1.864941  3408
## [3]  {gill_size=broad,                                                                                   
##       stalk_surface_above_ring=smooth}  => {type=edible}    0.4155588  0.9398664 0.4421467 1.814514  3376
## [4]  {gill_size=broad,                                                                                   
##       stalk_surface_above_ring=smooth,                                                                   
##       veil_type=partial}                => {type=edible}    0.4155588  0.9398664 0.4421467 1.814514  3376
## [5]  {bruises=no,                                                                                        
##       gill_attachment=free,                                                                              
##       ring_number=one}                  => {type=poisonous} 0.4007878  0.7722960 0.5189562 1.602179  3256
## [6]  {bruises=no,                                                                                        
##       gill_attachment=free,                                                                              
##       veil_type=partial,                                                                                 
##       ring_number=one}                  => {type=poisonous} 0.4007878  0.7722960 0.5189562 1.602179  3256
## [7]  {bruises=no,                                                                                        
##       ring_number=one}                  => {type=poisonous} 0.4007878  0.7386570 0.5425899 1.532393  3256
## [8]  {bruises=no,                                                                                        
##       veil_type=partial,                                                                                 
##       ring_number=one}                  => {type=poisonous} 0.4007878  0.7386570 0.5425899 1.532393  3256
## [9]  {bruises=no,                                                                                        
##       veil_color=white}                 => {type=poisonous} 0.4042344  0.7220756 0.5598227 1.497993  3284
## [10] {bruises=no,                                                                                        
##       veil_type=partial,                                                                                 
##       veil_color=white}                 => {type=poisonous} 0.4042344  0.7220756 0.5598227 1.497993  3284
## [11] {bruises=no,                                                                                        
##       gill_attachment=free}             => {type=poisonous} 0.4030034  0.7214632 0.5585918 1.496723  3274
## [12] {bruises=no,                                                                                        
##       gill_attachment=free,                                                                              
##       veil_type=partial}                => {type=poisonous} 0.4030034  0.7214632 0.5585918 1.496723  3274
## [13] {bruises=no,                                                                                        
##       gill_attachment=free,                                                                              
##       veil_color=white}                 => {type=poisonous} 0.4020187  0.7209713 0.5576071 1.495702  3266
## [14] {bruises=no,                                                                                        
##       gill_attachment=free,                                                                              
##       veil_type=partial,                                                                                 
##       veil_color=white}                 => {type=poisonous} 0.4020187  0.7209713 0.5576071 1.495702  3266
## [15] {stalk_surface_above_ring=smooth}  => {type=edible}    0.4480551  0.7032457 0.6371246 1.357692  3640
## [16] {stalk_surface_above_ring=smooth,                                                                   
##       veil_type=partial}                => {type=edible}    0.4480551  0.7032457 0.6371246 1.357692  3640

pruned_rules <- rules_3[!is.redundant(rules_3)]
summary(pruned_rules)

## set of 7 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 4 
## 2 4 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.500   3.000   2.857   3.000   4.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.4008   Min.   :0.7032   Min.   :0.4343   Min.   :1.358  
##  1st Qu.:0.4019   1st Qu.:0.7218   1st Qu.:0.4806   1st Qu.:1.497  
##  Median :0.4042   Median :0.7387   Median :0.5426   Median :1.532  
##  Mean   :0.4131   Mean   :0.7948   Mean   :0.5276   Mean   :1.595  
##  3rd Qu.:0.4175   3rd Qu.:0.8561   3rd Qu.:0.5592   3rd Qu.:1.708  
##  Max.   :0.4481   Max.   :0.9660   Max.   :0.6371   Max.   :1.865  
##      count     
##  Min.   :3256  
##  1st Qu.:3265  
##  Median :3284  
##  Mean   :3356  
##  3rd Qu.:3392  
##  Max.   :3640  
## 
## mining info:
##  data ntransactions support confidence
##    df          8124     0.4        0.7
##                                                                                                                                                      call
##  apriori(data = df, parameter = list(supp = supp_val, conf = conf_val, minlen = minlen_val), appearance = list(rhs = c("type=edible", "type=poisonous")))

prun_rules3_conf <- sort(pruned_rules, by = "lift", decreasing = TRUE)
inspect(prun_rules3_conf)

##     lhs                                  rhs                support confidence  coverage     lift count
## [1] {odor=none}                       => {type=edible}    0.4194978  0.9659864 0.4342688 1.864941  3408
## [2] {gill_size=broad,                                                                                  
##      stalk_surface_above_ring=smooth} => {type=edible}    0.4155588  0.9398664 0.4421467 1.814514  3376
## [3] {bruises=no,                                                                                       
##      gill_attachment=free,                                                                             
##      ring_number=one}                 => {type=poisonous} 0.4007878  0.7722960 0.5189562 1.602179  3256
## [4] {bruises=no,                                                                                       
##      ring_number=one}                 => {type=poisonous} 0.4007878  0.7386570 0.5425899 1.532393  3256
## [5] {bruises=no,                                                                                       
##      veil_color=white}                => {type=poisonous} 0.4042344  0.7220756 0.5598227 1.497993  3284
## [6] {bruises=no,                                                                                       
##      gill_attachment=free}            => {type=poisonous} 0.4030034  0.7214632 0.5585918 1.496723  3274
## [7] {stalk_surface_above_ring=smooth} => {type=edible}    0.4480551  0.7032457 0.6371246 1.357692  3640

In the following plots, we can observe the results obtained, both redundant and non-redundant, presented in scatter plots, graphs, and clusters:

grid.arrange(
  plot(rules_2),
  plot(rules_2, method = "two-key plot"),
  ncol = 2
)

The following plots represent items and rules as vertices connecting them with directed edges. This representation focuses on how the rules are composed of individual items and shows which rules share items.

plot(rules_3, method = "graph")

plot(pruned_rules, method = "graph")

To visualize the grouped matrix we use a balloon plot with antecedent groups as columns and consequent’s as rows. The color of the balloons represent the lift and the size of the balloon shows support. Furthermore, the columns and rows in the plot are reordered in such a way that the the most interesting group is placed in the top left corner. (Lift is decreasing from left to right.)

plot(pruned_rules, method = "grouped")

Eclat

Eclat is a data mining algorithm used to extract sets of frequent items from transactional datasets. The name “Eclat” is derived from “Equivalence Class Clustering and bottom-up Lattice Traversal,” reflecting its approach based on equivalence class clustering and bottom-up lattice traversal. In other words, it is an effective tool for discovering patterns of frequent itemsets in extensive transactional datasets.

Getting results

run_rule_induction <- function(df, support_values, min_confidence_values) {
  result_df <- data.frame()

  for (min_support in support_values) {
    for (min_confidence in min_confidence_values) {
      eclad_rules <- eclat(df,
        parameter = list(support = min_support, tidLists = TRUE),
        control = list(verbose = FALSE)
      )

      rulesf <- ruleInduction(eclad_rules, confidence = min_confidence)
      red_rules <- rulesf
      rulesf <- rulesf[!is.redundant(rulesf)]

      filtered_rules <- subset(
        rulesf,
        rhs %in% c("type=edible", "type=poisonous")
      )

      result_df <- rbind(
        result_df,
        data.frame(
          Support = min_support,
          Confidence = min_confidence,
          Redundant_Rules = length(red_rules),
          Non_Redundant_Rules = length(filtered_rules)
        )
      )
    }
  }

  return(result_df)
}

Using the function run_rule_induction we can obtain the results of the different possible cobination for the following support and confidence values:

Support: 0.4, 0.5, 0.6, 0.7, 0.8, 0.9
Confidence: 0.5, 0.6, 0.7, 0.8, 0.9

support_values <- c(0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
min_confidence <- c(0.5, 0.6, 0.7, 0.8, 0.9)

results <- run_rule_induction(df, support_values, min_confidence)
print(results)

##    Support Confidence Redundant_Rules Non_Redundant_Rules
## 1      0.4        0.5            1810                  23
## 2      0.4        0.6            1625                  10
## 3      0.4        0.7            1454                   7
## 4      0.4        0.8            1370                   2
## 5      0.4        0.9            1137                   2
## 6      0.5        0.5             430                   1
## 7      0.5        0.6             383                   0
## 8      0.5        0.7             318                   0
## 9      0.5        0.8             318                   0
## 10     0.5        0.9             269                   0
## 11     0.6        0.5             120                   0
## 12     0.6        0.6             120                   0
## 13     0.6        0.7             103                   0
## 14     0.6        0.8             103                   0
## 15     0.6        0.9              86                   0
## 16     0.7        0.5              75                   0
## 17     0.7        0.6              75                   0
## 18     0.7        0.7              75                   0
## 19     0.7        0.8              75                   0
## 20     0.7        0.9              60                   0
## 21     0.8        0.5              47                   0
## 22     0.8        0.6              47                   0
## 23     0.8        0.7              47                   0
## 24     0.8        0.8              47                   0
## 25     0.8        0.9              40                   0
## 26     0.9        0.5              11                   0
## 27     0.9        0.6              11                   0
## 28     0.9        0.7              11                   0
## 29     0.9        0.8              11                   0
## 30     0.9        0.9              11                   0

Now, we will select the result for 7 rules using a support of 0.4 and confidence of 0.7. By running it, leads to the formation of a set of 565 itemsets. After eliminating redundant rules, we are left with 7 rules, as it can be seen the following plots:

eclad_rules <- eclat(df, parameter = list(support = 0.4, tidLists = TRUE))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##      TRUE     0.4      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 3249 
## 
## create itemset ... 
## set transactions ...[119 item(s), 8124 transaction(s)] done [0.01s].
## sorting and recoding items ... [21 item(s)] done [0.00s].
## creating bit matrix ... [21 row(s), 8124 column(s)] done [0.00s].
## writing  ... [565 set(s)] done [0.01s].
## Creating S4 object  ... done [0.02s].

summary(eclad_rules)

## set of 565 itemsets
## 
## most frequent items:
##    veil_type=partial gill_attachment=free     veil_color=white 
##                  283                  270                  268 
##      ring_number=one   gill_spacing=close              (Other) 
##                  222                  190                  694 
## 
## element (itemset/transaction) length distribution:sizes
##   1   2   3   4   5   6   7 
##  21  97 185 170  76  15   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.411   4.000   7.000 
## 
## summary of quality measures:
##     support           count     
##  Min.   :0.4008   Min.   :3256  
##  1st Qu.:0.4205   1st Qu.:3416  
##  Median :0.4609   Median :3744  
##  Mean   :0.4906   Mean   :3986  
##  3rd Qu.:0.5180   3rd Qu.:4208  
##  Max.   :1.0000   Max.   :8124  
## 
## includes transaction ID lists: TRUE 
## 
## mining info:
##  data ntransactions support
##    df          8124     0.4
##                                                                call
##  eclat(data = df, parameter = list(support = 0.4, tidLists = TRUE))

rulesf <- ruleInduction(eclad_rules, confidence = 0.7)
rulesf <- rulesf[!is.redundant(rulesf)]
filtered_rules <- subset(rulesf, rhs %in% c("type=edible", "type=poisonous"))
inspect(filtered_rules)

##     lhs                                  rhs                support confidence     lift
## [1] {odor=none}                       => {type=edible}    0.4194978  0.9659864 1.864941
## [2] {bruises=no,                                                                       
##      gill_attachment=free,                                                             
##      ring_number=one}                 => {type=poisonous} 0.4007878  0.7722960 1.602179
## [3] {bruises=no,                                                                       
##      veil_color=white}                => {type=poisonous} 0.4042344  0.7220756 1.497993
## [4] {bruises=no,                                                                       
##      gill_attachment=free}            => {type=poisonous} 0.4030034  0.7214632 1.496723
## [5] {bruises=no,                                                                       
##      ring_number=one}                 => {type=poisonous} 0.4007878  0.7386570 1.532393
## [6] {gill_size=broad,                                                                  
##      stalk_surface_above_ring=smooth} => {type=edible}    0.4155588  0.9398664 1.814514
## [7] {stalk_surface_above_ring=smooth} => {type=edible}    0.4480551  0.7032457 1.357692

In the following plots, we can observe the results obtained, both redundant and non-redundant, presented in scatter plots, graphs, and clusters:

grid.arrange(
  plot(rulesf),
  plot(rulesf, method = "two-key plot"),
  ncol = 2
)

plot(rulesf, method = "graph")

plot(filtered_rules, method = "graph")

plot(filtered_rules, method = "grouped")

Conclusion

To conclude this task, we can compare the two results obtained from the Apriori and Eclat algorithms, as shown below:

invisible(grid.arrange(
  plot(pruned_rules, method = "grouped") + ggtitle("Apriori"),
  plot(filtered_rules, method = "grouped") + ggtitle("Eclat"),
  ncol = 2
))

As we can observe in both cases, we have achieved the same results with two different algorithms and a similar configuration.

After the study we can conclude that the following association rules are the ones that help us to say if a mushroom is poisonous or not.

Edible:
- {odor=none}
- {stalk_surface_below_ring=smooth}
- {stalk_surface_above_ring=smooth}
- {gill_size=broad}
- {gill_size=broad, stalk_surface_above_ring=smooth}
Poisonous:
- {bruises=no}
- {bruises=no, ring_number=one}
- {bruises=no, gill_attachment=free}
- {bruises=no, veil_color=white}
- {bruises=no, gill_attachment=free, ring_number=one}

Assignment 5 - Data mining for mushroom dataset

Sergi Mayol and Toni Garri

2023-12-26

Introduction

Study problem statement

Questions to answer

Libraries

Data

Data preparation

Apriori

Testing the algorithm

Getting results

Eclat

Getting results

Conclusion