The data set calif_penn_2011.csv contains information about the housing stock of California and Pennsylvania, as of 2011. Information as aggregated into “Census tracts”, geographic regions of a few thousand people which are supposed to be fairly homogeneous economically and socially.
Loading and cleaning
ca_pa
.ca_pa <- read.csv("data/calif_penn_2011.csv", header=T)[,-1]
nrow(ca_pa)
## [1] 11275
ncol(ca_pa)
## [1] 33
colSums(apply(ca_pa,c(1,2),is.na))
Count the number of missing data in each column.
na.omit()
takes a dataframe and returns a new dataframe, omitting any row containing an NA value. Use it to purge the data set of rows with incomplete data.ca_pa <- na.omit(ca_pa)
nrow(read.csv("data/calif_penn_2011.csv", header=T))-nrow(ca_pa)
## [1] 670
Yes. Some rows may have more than one missing data.
This Very New House
Built_2005_or_later
indicates the percentage of houses in each Census tract built since 2005. Plot median house prices against this variable.ggplot(ca_pa, aes(Built_2005_or_later, Median_house_value)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Or we can calculate the mean of the median house value for each value of variable Built_2005_or_later
. So the number of points can be fewer.
ca_pa %>% group_by(Built_2005_or_later) %>%
summarize(median_house_price = mean(Median_house_value)) %>%
ggplot(aes(Built_2005_or_later, median_house_price)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
STATEFP
variable, with California being state 6 and Pennsylvania state 42.ca_pa$STATEFP <- factor(ca_pa$STATEFP)
ca_pa %>% group_by(Built_2005_or_later, STATEFP) %>%
summarize(median_house_price = mean(Median_house_value)) %>%
ggplot(aes(Built_2005_or_later, median_house_price, color = STATEFP)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Nobody Home
The vacancy rate is the fraction of housing units which are not occupied. The dataframe contains columns giving the total number of housing units for each Census tract, and the number of vacant housing units.
ca_pa <- ca_pa %>% mutate(vacancy_rate = Vacant_units/Total_units)
summary(ca_pa$vacancy_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03846 0.06767 0.08889 0.10921 0.96531
ca_pa %>%
ggplot(aes(Median_house_value, vacancy_rate)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ca_pa %>%
ggplot(aes(Median_house_value, vacancy_rate, shape = STATEFP, color = STATEFP)) +
geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
p1 <- ca_pa %>% filter(STATEFP == 6) %>% ggplot(aes(Median_house_value, vacancy_rate)) +
geom_point() + ggtitle("California")
p2 <- ca_pa %>% filter(STATEFP ==42) %>% ggplot(aes(Median_house_value, vacancy_rate)) +
geom_point() + ggtitle("Pennsylvania")
ggarrange(p1, p2, ncol = 2, nrow = 1)
The prices of the houses are lower in Pennysylvania, so the points are clustered at the left-bottom. And the variance of the vacancy rate in Pennysylvania is also lower.
The column COUNTYFP
contains a numerical code for counties within each state. We are interested in Alameda County (county 1 in California), Santa Clara (county 85 in California), and Allegheny County (county 3 in Pennsylvania).
Object: Calculate the median number of total units in Alameda County, California.
Fisrtly, pick the row number of Alameda County, California and save it as acca
. Then using acca
as index, the information of total units in Alameda County, California is saved in accamhv
. The method is loop trasversal. Lastly the median number can be calculated from accamhv
.
ca_pa %>% filter(STATEFP == 6, COUNTYFP == 1) %>%
select(Total_units) %>% unlist() %>% median()
## [1] 1606
ca_house_2005 <- ca_pa %>%
filter(((STATEFP == 6) & (COUNTYFP %in% c(1,85))) |
((STATEFP == 42) & (COUNTYFP == 3)) ) %>%
group_by(COUNTYFP) %>% summarize(mean_2005 = mean(Built_2005_or_later))
ca_house_2005[,1] <- c('Alameda', 'Santa Clara', 'Allegheny')
ca_house_2005
## # A tibble: 3 x 2
## COUNTYFP mean_2005
## <chr> <dbl>
## 1 Alameda 2.82
## 2 Santa Clara 1.47
## 3 Allegheny 3.20
cor
function calculates the correlation coefficient between two variables. What is the correlation between median house value and the percent of housing built since 2005 in (i) the whole data, (ii) all of California, (iii) all of Pennsylvania, (iv) Alameda County, (v) Santa Clara County and (vi) Allegheny County?mycor <- function(x){
return(cor(x$Median_house_value, x$Built_2005_or_later))
}
mycor(ca_pa)
## [1] -0.01893186
ca_pa %>% filter(STATEFP == 6) %>% mycor()
## [1] -0.1153604
ca_pa %>% filter(STATEFP == 42) %>% mycor()
## [1] 0.2681654
ca_pa %>% filter(STATEFP == 6, COUNTYFP == 1) %>% mycor()
## [1] 0.01303543
ca_pa %>% filter(STATEFP == 6, COUNTYFP == 85) %>% mycor()
## [1] -0.1726203
ca_pa %>% filter(STATEFP == 42, COUNTYFP == 3) %>% mycor()
## [1] 0.1939652
p1 <- ca_pa %>% filter(STATEFP == 6, COUNTYFP == 1) %>%
ggplot(aes(Median_household_income, Median_house_value)) +
geom_point() + ggtitle("Alameda")
p2 <- ca_pa %>% filter(STATEFP == 6, COUNTYFP == 85) %>%
ggplot(aes(Median_household_income, Median_house_value)) +
geom_point() + ggtitle("Santa Clara")
p3 <- ca_pa %>% filter(STATEFP ==42, COUNTYFP == 3) %>%
ggplot(aes(Median_household_income, Median_house_value)) +
geom_point() + ggtitle("Allegheny")
ggarrange(p1, p2, p3, ncol = 3, nrow = 1)
acca <- c()
for (tract in 1:nrow(ca_pa)) {
if (ca_pa$STATEFP[tract] == 6) {
if (ca_pa$COUNTYFP[tract] == 1) {
acca <- c(acca, tract)
}
}
}
accamhv <- c()
for (tract in acca) {
accamhv <- c(accamhv, ca_pa[tract,10])
}
median(accamhv)
MB.Ch1.11. Run the following code:
gender <- factor(c(rep("female", 91), rep("male", 92)))
table(gender)
gender <- factor(gender, levels=c("male", "female"))
table(gender)
gender <- factor(gender, levels=c("Male", "female"))
# Note the mistake: "Male" should be "male"
table(gender)
table(gender, exclude=NULL)
rm(gender) # Remove gender
Explain the output from the successive uses of table().
Table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
gender
a factor with two levels female
, male
.levels
to be “Male” and “female”, but there is no Male in gender.MB.Ch1.12. Write a function that calculates the proportion of values in a vector x that exceed some value cutoff.
func_cutoff <- function(x, thresh){
return(prop.table(table(x>thresh))[2] %>% unname())
}
x = 1:100
thresh = 25
func_cutoff(x,thresh)
## [1] 0.75
# The package has been removed
MB.Ch1.18. The Rabbit data frame in the MASS library contains blood pressure change measurements on five rabbits (labeled as R1, R2, . . . ,R5) under various control and treatment conditions. Read the help file for more information. Use the unstack() function (three times) to convert Rabbit to the following form:
Treatment Dose R1 R2 R3 R4 R5
1 Control 6.25 0.50 1.00 0.75 1.25 1.5
2 Control 12.50 4.50 1.25 3.00 1.50 1.5
….
# method 1: Use spread()
# as.data.frame(MASS::Rabbit)[,-3] %>% spread(Animal,BPchange) %>% arrange(Treatment)
# method 2: Use the unstack() function (three times)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:DAAG':
##
## hills
## The following object is masked from 'package:dplyr':
##
## select
Dose <- unstack(Rabbit, Dose ~ Animal)[,1]
Treatment <- unstack(Rabbit, Treatment ~ Animal)[,1]
BPchange <- unstack(Rabbit, BPchange ~ Animal)
data.frame(Treatment, Dose, BPchange)
## Treatment Dose R1 R2 R3 R4 R5
## 1 Control 6.25 0.50 1.00 0.75 1.25 1.5
## 2 Control 12.50 4.50 1.25 3.00 1.50 1.5
## 3 Control 25.00 10.00 4.00 3.00 6.00 5.0
## 4 Control 50.00 26.00 12.00 14.00 19.00 16.0
## 5 Control 100.00 37.00 27.00 22.00 33.00 20.0
## 6 Control 200.00 32.00 29.00 24.00 33.00 18.0
## 7 MDL 6.25 1.25 1.40 0.75 2.60 2.4
## 8 MDL 12.50 0.75 1.70 2.30 1.20 2.5
## 9 MDL 25.00 4.00 1.00 3.00 2.00 1.5
## 10 MDL 50.00 9.00 2.00 5.00 3.00 2.0
## 11 MDL 100.00 25.00 15.00 26.00 11.00 9.0
## 12 MDL 200.00 37.00 28.00 25.00 22.00 19.0