For this assignment, the following libraries are used:
library(car)
## Loading required package: carData
library(ggplot2)
library(gridExtra)
This assignment uses the data frame Prestige included in the “car” package.
dataset <- Prestige
1.- Give a summary of the Prestige data frame. Indicate the size (number of observations and variables) and type of variables.
summary(dataset)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
dim(dataset)
## [1] 102 6
str(dataset)
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
2.- The focus in this assignment is the regression model, where “income” will be the dependent variable. Draw different plots to see the relationship between the variables.
To simplify the naming of the dataset it can be renamed to
df
df <- dataset
Using gglplot
and gridExtra
libraries, it’s
possible to plot relationship between the variables.
grid_layout <- grid.arrange(
ggplot(df, aes(x=education, y=income )) + geom_point() + ggtitle("Income vs. Education"),
ggplot(df, aes(x=prestige, y=income )) + geom_point() + ggtitle("Income vs. Prestige"),
ggplot(df, aes(x=women, y=income )) + geom_point() + ggtitle("Income vs. Women"),
ggplot(df, aes(x=census, y=income )) + geom_point() + ggtitle("Income vs. Census"),
ggplot(df, aes(x=type, y=income )) + geom_point() + ggtitle("Income vs. Type"),
ncol = 3
)
grid_layout
## TableGrob (2 x 3) "arrange": 5 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (2-2,1-1) arrange gtable[layout]
## 5 5 (2-2,2-2) arrange gtable[layout]
3.- Consider only variables “income” and “education”. Draw a histogram for variable “income”. Draw the boxplot for “education”. Draw a scatterplot of both variables to see their relationship. Fit a linear regression and see how well this model fits the observed data. Draw the model over the scatterplot. What do you observe? Do the regression assumptions hold?
income
variable:ggplot(df, aes(x = income)) +
geom_histogram(binwidth = 5000, fill = "skyblue", color = "black", alpha = 0.7) +
labs(title = "Histogram of Income", x = "Income", y = "Frequency")
education
variable:ggplot(df, aes(y = education)) +
geom_boxplot() +
labs(title = "Boxplot of Education", y = "Education")
education vs income
:ggplot(df, aes(x = education, y = income)) +
geom_point() +
labs(title = "Scatter Plot of Education vs Income", x = "Education", y = "Income")
ggplot(df, aes(x=education, y=income)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="Regression Plot: Education vs. Income", x="Education", y="Income")
## `geom_smooth()` using formula = 'y ~ x'
lin_model <- lm(income ~ education, data = df)
summary(lin_model)
##
## Call:
## lm(formula = income ~ education, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5493.2 -2433.8 -41.9 1491.5 17713.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.6 1407.0 -2.028 0.0452 *
## education 898.8 127.0 7.075 2.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.3269
## F-statistic: 50.06 on 1 and 100 DF, p-value: 2.079e-10
# Residuals vs Fitted Values
plot(lin_model, which = 1)
As the result above shows, the R2 of the model has an score of \(\approx 0.33\)%, which is not good. It can be seen that there are some outliers, and there’s a noticeable dispersion of the data that makes the regression model struggle to accurately predict values that fall outside the norm. Let’s assess the validity of our assumptions:
Linearity in the relationship between X and Y: As can be seen in the plots, there’s no evidence supporting a linear relationship. In fact, some values decrease despite an increase in education.
Normal distribution of Y at each X value: The data does not exhibit a normal distribution, with fewer data points as income increases.
Uniform variance of Y at every X value: This assumption does not hold, as certain values are substantially larger than others, displaying varying degrees of difference.
Independence of observations: We can assume independence since income is not influenced by other incomes.
4.- Consider the polynomial regression model. How well does the polynomial model fit the data? Compare the R2 of both linear and polynomial models. What do you conclude?
poly_model <- lm(income ~ poly(education, 2), data = df)
summary(poly_model)
##
## Call:
## lm(formula = income ~ poly(education, 2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5951.4 -2091.1 -358.2 1762.4 18574.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6797.9 333.5 20.381 < 2e-16 ***
## poly(education, 2)1 24645.9 3368.5 7.316 6.77e-11 ***
## poly(education, 2)2 9488.3 3368.5 2.817 0.00586 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3369 on 99 degrees of freedom
## Multiple R-squared: 0.383, Adjusted R-squared: 0.3706
## F-statistic: 30.73 on 2 and 99 DF, p-value: 4.146e-11
scatter_plot <- ggplot(df, aes(x = education, y = income)) +
geom_point() +
labs(title = "Education vs Income", x = "Education", y = "Income")
scatter_plot + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "red") +
labs(title = "Polynomial Regression: Education vs Income")
As can be seen above, the polynomial regression model (\(R^2\)) seems to fit better and be more sensitive to the data, especially the outliers and further values. This suggests that it performs slightly better as the income raises. Additionally, it appears to be a better model than the linear one, as indicated by the model summary.
linear_reg_plot <- ggplot(df, aes(x=education, y=income)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="Regression Plot: Education vs. Income", x="Education", y="Income")
scatter_plot <- ggplot(df, aes(x = education, y = income)) +
geom_point() +
labs(title = "Education vs Income", x = "Education", y = "Income")
poly_reg_plot <- scatter_plot + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "red") + labs(title = "Polynomial Regression: Education vs Income")
grid_layout <- grid.arrange(
linear_reg_plot,
poly_reg_plot,
ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'
grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
It can be concluded that the polynimial model fits slightly better to the data, as said before.
5.- Would some transformation procedure of the variables, such as log or square root, allow for a better model? Try out different transformations and plot the transformed variables. Which transformation is the “best”? For the best transformation, fit a regression model.
# Log transformation
df$income_log <- log(df$income)
df$education_log <- log(df$education)
# Sqrt transformation
df$income_sqrt <- sqrt(df$income)
df$education_sqrt <- sqrt(df$education)
# Square transformation
df$income_sqr <- df$income^2
df$education_sqr <- df$education^2
linear_vs_ploy_plot <- function(df, x, y) {
linear_reg_plot <- ggplot(df, aes(x = !!rlang::sym(x), y = !!rlang::sym(y))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = paste("Regression Plot:", x, "vs.", y), x = x, y = y)
scatter_plot <- ggplot(df, aes(x = !!rlang::sym(x), y = !!rlang::sym(y))) +
geom_point() +
labs(title = paste(x, "vs", y), x = x, y = y)
poly_reg_plot <- scatter_plot + geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "red") +
labs(title = paste("Polynomial Regression:", x, "vs.", y))
list(linear_reg_plot = linear_reg_plot, poly_reg_plot = poly_reg_plot)
}
Now, that linear_vs_ploy_plot
function can be applied to
the transformations:
log_plots <- linear_vs_ploy_plot(df, "education_log", "income_log")
sqrt_plots <- linear_vs_ploy_plot(df, "education_sqrt", "income_sqrt")
sqr_plots <- linear_vs_ploy_plot(df, "education_sqr", "income_sqr")
grid_layout <- grid.arrange(
log_plots$linear_reg_plot,
log_plots$poly_reg_plot,
ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'
grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
grid_layout <- grid.arrange(
sqrt_plots$linear_reg_plot,
sqrt_plots$poly_reg_plot,
ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'
grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
grid_layout <- grid.arrange(
sqr_plots$linear_reg_plot,
sqr_plots$poly_reg_plot,
ncol = 2
)
## `geom_smooth()` using formula = 'y ~ x'
grid_layout
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Now that we have seeing the model, we can see the scores:
lin_model_log <- lm(income_log ~ education_log, data = df)
lin_model_sqrt <-lm(income_sqrt ~ education_sqrt, data = df)
lin_model_sqr <-lm(income_sqr ~ education_sqr, data = df)
summary(lin_model_log)
##
## Call:
## lm(formula = income_log ~ education_log, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.13072 -0.25116 0.04715 0.32647 1.30096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8209 0.4638 12.550 < 2e-16 ***
## education_log 1.2127 0.1969 6.158 1.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5061 on 100 degrees of freedom
## Multiple R-squared: 0.275, Adjusted R-squared: 0.2677
## F-statistic: 37.92 on 1 and 100 DF, p-value: 1.544e-08
summary(lin_model_sqrt)
##
## Call:
## lm(formula = income_sqrt ~ education_sqrt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.856 -10.978 1.046 10.927 73.623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.201 14.854 -1.697 0.0929 .
## education_sqrt 32.115 4.533 7.085 1.99e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.89 on 100 degrees of freedom
## Multiple R-squared: 0.3342, Adjusted R-squared: 0.3275
## F-statistic: 50.19 on 1 and 100 DF, p-value: 1.987e-10
summary(lin_model_sqr)
##
## Call:
## lm(formula = income_sqr ~ education_sqr, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -114125462 -43563690 -4266808 16134796 582936148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36828821 19929426 -1.848 0.0676 .
## education_sqr 822416 145398 5.656 1.47e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89780000 on 100 degrees of freedom
## Multiple R-squared: 0.2424, Adjusted R-squared: 0.2348
## F-statistic: 31.99 on 1 and 100 DF, p-value: 1.474e-07
The ‘best’ transformation, as seen in the results from the models, is the square root, as it yields the highest Adjusted R-squared score (\(\approx 0.3275\)). However, this score is still slightly worse than the linear model fitted on non-transformed data.
Finally, we can try to fit the model with only one data part transformed from the squared root:
summary(lm(income ~ education_sqr, data = df))
##
## Call:
## lm(formula = income ~ education_sqr, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5738.4 -2176.7 -131.8 1579.8 17936.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1717.786 758.300 2.265 0.0257 *
## education_sqr 41.411 5.532 7.485 2.86e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3416 on 100 degrees of freedom
## Multiple R-squared: 0.3591, Adjusted R-squared: 0.3527
## F-statistic: 56.03 on 1 and 100 DF, p-value: 2.855e-11
summary(lm(income_sqr ~ education, data = df))
##
## Call:
## lm(formula = income_sqr ~ education, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -108608864 -45394399 -6390595 16077490 578754457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -125767142 36776916 -3.420 0.000909 ***
## education 17678248 3320427 5.324 6.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91050000 on 100 degrees of freedom
## Multiple R-squared: 0.2209, Adjusted R-squared: 0.2131
## F-statistic: 28.35 on 1 and 100 DF, p-value: 6.248e-07
As it can be seeing, the results doesn’t improve significantly and still a bad score from the model.