Libraries

For this assignment, the following libraries are used:

library(car)
## Loading required package: carData
library(ggplot2)
library(gridExtra)

Data

This assignment uses the data frame Prestige included in the “car” package.

dataset <- Prestige

Questions

1.- Give a summary of the Prestige data frame. Indicate the size (number of observations and variables) and type of variables.

  • The summary will be the following:
summary(dataset)
##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517
  • The size (number of observations and variables):
dim(dataset)
## [1] 102   6
  • The type of the variables:
str(dataset)
## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

2.- The focus in this assignment is the regression model, where “income” will be the dependent variable. Draw different plots to see the relationship between the variables.

To simplify the naming of the dataset it can be renamed to df

df <- dataset

Using gglplot and gridExtra libraries, it’s possible to plot relationship between the variables.

grid_layout <- grid.arrange(
  ggplot(df, aes(x=education, y=income )) + geom_point() + ggtitle("Income vs. Education"),
  ggplot(df, aes(x=prestige, y=income )) + geom_point() + ggtitle("Income vs. Prestige"),
  ggplot(df, aes(x=women, y=income )) + geom_point() + ggtitle("Income vs. Women"),
  ggplot(df, aes(x=census, y=income )) + geom_point() + ggtitle("Income vs. Census"),
  ggplot(df, aes(x=type, y=income )) + geom_point() + ggtitle("Income vs. Type"),
  ncol = 3
)

grid_layout
## TableGrob (2 x 3) "arrange": 5 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (2-2,1-1) arrange gtable[layout]
## 5 5 (2-2,2-2) arrange gtable[layout]

3.- Consider only variables “income” and “education”. Draw a histogram for variable “income”. Draw the boxplot for “education”. Draw a scatterplot of both variables to see their relationship. Fit a linear regression and see how well this model fits the observed data. Draw the model over the scatterplot. What do you observe? Do the regression assumptions hold?

  • Histogram for the income variable:
ggplot(df, aes(x = income)) +
  geom_histogram(binwidth = 5000, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Histogram of Income", x = "Income", y = "Frequency")

  • Boxplot for the education variable:
ggplot(df, aes(y = education)) +
  geom_boxplot() +
  labs(title = "Boxplot of Education", y = "Education")

  • Scatter plot of education vs income:
ggplot(df, aes(x = education, y = income)) +
  geom_point() +
  labs(title = "Scatter Plot of Education vs Income", x = "Education", y = "Income")

  • Linear regression on the observed data:
ggplot(df, aes(x=education, y=income)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Regression Plot: Education vs. Income", x="Education", y="Income")
## `geom_smooth()` using formula = 'y ~ x'

lin_model <- lm(income ~ education, data = df)
summary(lin_model)
## 
## Call:
## lm(formula = income ~ education, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2853.6     1407.0  -2.028   0.0452 *  
## education      898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10
# Residuals vs Fitted Values
plot(lin_model, which = 1)

As the result above shows, the R2 of the model has an score of \(\approx 0.33\)%, which is not good. It can be seen that there are some outliers, and there’s a noticeable dispersion of the data that makes the regression model struggle to accurately predict values that fall outside the norm. Let’s assess the validity of our assumptions:

  1. Linearity in the relationship between X and Y: As can be seen in the plots, there’s no evidence supporting a linear relationship. In fact, some values decrease despite an increase in education.

  2. Normal distribution of Y at each X value: The data does not exhibit a normal distribution, with fewer data points as income increases.

  3. Uniform variance of Y at every X value: This assumption does not hold, as certain values are substantially larger than others, displaying varying degrees of difference.

  4. Independence of observations: We can assume independence since income is not influenced by other incomes.

4.- Consider the polynomial regression model. How well does the polynomial model fit the data? Compare the R2 of both linear and polynomial models. What do you conclude?

poly_model <- lm(income ~ poly(education, 2), data = df)
summary(poly_model)
## 
## Call:
## lm(formula = income ~ poly(education, 2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5951.4 -2091.1  -358.2  1762.4 18574.2 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6797.9      333.5  20.381  < 2e-16 ***
## poly(education, 2)1  24645.9     3368.5   7.316 6.77e-11 ***
## poly(education, 2)2   9488.3     3368.5   2.817  0.00586 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3369 on 99 degrees of freedom
## Multiple R-squared:  0.383,  Adjusted R-squared:  0.3706 
## F-statistic: 30.73 on 2 and 99 DF,  p-value: 4.146e-11
scatter_plot <- ggplot(df, aes(x = education, y = income)) +
  geom_point() +
  labs(title = "Education vs Income", x = "Education", y = "Income")

scatter_plot + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "red") +
  labs(title = "Polynomial Regression: Education vs Income")

As can be seen above, the polynomial regression model (\(R^2\)) seems to fit better and be more sensitive to the data, especially the outliers and further values. This suggests that it performs slightly better as the income raises. Additionally, it appears to be a better model than the linear one, as indicated by the model summary.

  • Comparation between the Linear regression and Polynomial regression model, side to side:
linear_reg_plot <- ggplot(df, aes(x=education, y=income)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="Regression Plot: Education vs. Income", x="Education", y="Income")

scatter_plot <- ggplot(df, aes(x = education, y = income)) +
  geom_point() +
  labs(title = "Education vs Income", x = "Education", y = "Income")

poly_reg_plot <- scatter_plot + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "red") + labs(title = "Polynomial Regression: Education vs Income")

grid_layout <- grid.arrange(
  linear_reg_plot,
  poly_reg_plot,
  ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'

grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

It can be concluded that the polynimial model fits slightly better to the data, as said before.

5.- Would some transformation procedure of the variables, such as log or square root, allow for a better model? Try out different transformations and plot the transformed variables. Which transformation is the “best”? For the best transformation, fit a regression model.

  • Some transformation of the variables can be the following:
# Log transformation
df$income_log <- log(df$income)
df$education_log <- log(df$education)

# Sqrt transformation
df$income_sqrt <- sqrt(df$income)
df$education_sqrt <- sqrt(df$education)

# Square transformation
df$income_sqr <- df$income^2
df$education_sqr <- df$education^2
  • A function can be defined to get the Linear regression vs Polynomial regression models plots:
linear_vs_ploy_plot <- function(df, x, y) {
  linear_reg_plot <- ggplot(df, aes(x = !!rlang::sym(x), y = !!rlang::sym(y))) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = paste("Regression Plot:", x, "vs.", y), x = x, y = y)

  scatter_plot <- ggplot(df, aes(x = !!rlang::sym(x), y = !!rlang::sym(y))) +
    geom_point() +
    labs(title = paste(x, "vs", y), x = x, y = y)

  poly_reg_plot <- scatter_plot + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "red") +
    labs(title = paste("Polynomial Regression:", x, "vs.", y))

  list(linear_reg_plot = linear_reg_plot, poly_reg_plot = poly_reg_plot)
}

Now, that linear_vs_ploy_plot function can be applied to the transformations:

log_plots <- linear_vs_ploy_plot(df, "education_log", "income_log")
sqrt_plots <- linear_vs_ploy_plot(df, "education_sqrt", "income_sqrt")
sqr_plots <- linear_vs_ploy_plot(df, "education_sqr", "income_sqr")
  • Log transformations:
grid_layout <- grid.arrange(
  log_plots$linear_reg_plot,
  log_plots$poly_reg_plot,
  ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'

grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
  • Square root transformations:
grid_layout <- grid.arrange(
  sqrt_plots$linear_reg_plot,
  sqrt_plots$poly_reg_plot,
  ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'

grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
  • Square transformations:
grid_layout <- grid.arrange(
  sqr_plots$linear_reg_plot,
  sqr_plots$poly_reg_plot,
  ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'

grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Now that we have seeing the model, we can see the scores:

lin_model_log <-  lm(income_log ~ education_log, data = df)
lin_model_sqrt <-lm(income_sqrt ~ education_sqrt, data = df)
lin_model_sqr <-lm(income_sqr ~ education_sqr, data = df)
summary(lin_model_log)
## 
## Call:
## lm(formula = income_log ~ education_log, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.13072 -0.25116  0.04715  0.32647  1.30096 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.8209     0.4638  12.550  < 2e-16 ***
## education_log   1.2127     0.1969   6.158 1.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5061 on 100 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.2677 
## F-statistic: 37.92 on 1 and 100 DF,  p-value: 1.544e-08
summary(lin_model_sqrt)
## 
## Call:
## lm(formula = income_sqrt ~ education_sqrt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.856 -10.978   1.046  10.927  73.623 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -25.201     14.854  -1.697   0.0929 .  
## education_sqrt   32.115      4.533   7.085 1.99e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.89 on 100 degrees of freedom
## Multiple R-squared:  0.3342, Adjusted R-squared:  0.3275 
## F-statistic: 50.19 on 1 and 100 DF,  p-value: 1.987e-10
summary(lin_model_sqr)
## 
## Call:
## lm(formula = income_sqr ~ education_sqr, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -114125462  -43563690   -4266808   16134796  582936148 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -36828821   19929426  -1.848   0.0676 .  
## education_sqr    822416     145398   5.656 1.47e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89780000 on 100 degrees of freedom
## Multiple R-squared:  0.2424, Adjusted R-squared:  0.2348 
## F-statistic: 31.99 on 1 and 100 DF,  p-value: 1.474e-07

The ‘best’ transformation, as seen in the results from the models, is the square root, as it yields the highest Adjusted R-squared score (\(\approx 0.3275\)). However, this score is still slightly worse than the linear model fitted on non-transformed data.

Finally, we can try to fit the model with only one data part transformed from the squared root:

summary(lm(income ~ education_sqr, data = df))
## 
## Call:
## lm(formula = income ~ education_sqr, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5738.4 -2176.7  -131.8  1579.8 17936.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1717.786    758.300   2.265   0.0257 *  
## education_sqr   41.411      5.532   7.485 2.86e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3416 on 100 degrees of freedom
## Multiple R-squared:  0.3591, Adjusted R-squared:  0.3527 
## F-statistic: 56.03 on 1 and 100 DF,  p-value: 2.855e-11
summary(lm(income_sqr ~ education, data = df))
## 
## Call:
## lm(formula = income_sqr ~ education, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -108608864  -45394399   -6390595   16077490  578754457 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -125767142   36776916  -3.420 0.000909 ***
## education     17678248    3320427   5.324 6.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91050000 on 100 degrees of freedom
## Multiple R-squared:  0.2209, Adjusted R-squared:  0.2131 
## F-statistic: 28.35 on 1 and 100 DF,  p-value: 6.248e-07

As it can be seeing, the results doesn’t improve significantly and still a bad score from the model.