Background: In the previous lectures and lab, we began to look at user-written functions. For this assignment we will continue with a look at fitting models by optimizing error functions, and making user-written functions parts of larger pieces of code.
In lecture, we saw how to estimate the parameter \(a\) in a nonlinear model,
\[ Y = y_0 N^a + \mathrm{noise} \] by minimizing the mean squared error \[ \frac{1}{n}\sum_{i=1}^{n}{(Y_i - y_0 N_i^a)^2}. \]
We did this by approximating the derivative of the MSE, and adjusting \(a\) by an amount proportional to that, stopping when the derivative became small. Our procedure assumed we knew \(y_0\). In this assignment, we will use a built-in R function to estimate both parameters at once; it uses a fancier version of the same idea.
Because the model is nonlinear, there is no simple formula for the parameter estimates in terms of the data. Also unlike linear models, there is no simple formula for the standard errors of the parameter estimates. We will therefore use a technique called the jackknife to get approximate standard errors.
Here is how the jackknife works:
You will estimate the power-law scaling model, and its uncertainty, using the data alluded to in lecture, available in the file gmp.dat
from lecture, which contains data for 2006.
gmp <- read.table("./data/gmp.dat")
gmp$pop <- round(gmp$gmp/gmp$pcgmp)
col
option to give each curve a different color (of your choice).gmp <- gmp %>% mutate(a_0.125 = 6611*(gmp/pcgmp)^(1/8),
a_0.1 = 6611*(gmp/pcgmp)^0.1, a_0.15 = 6611*(gmp/pcgmp)^0.15)
gmp_tidy <- gmp[,3:7] %>% gather(para_a_value, nlmfit, -pcgmp, -pop)
gmp_tidy %>% ggplot() + geom_point(aes(x = pop, y = pcgmp))+
labs(x = "Population", y = "Per-Capita Economic Output ($/person-year)",
title = "US Metropolitan Areas, 2006")+
geom_line(aes(x = pop, y = nlmfit, color = para_a_value), size = 1.5) +
scale_x_continuous(trans = 'log10') + # take the logarithm of the population
scale_color_manual(values=c("red", "blue", "green")) # set color manually
mse()
, which calculates the mean squared error of the model on a given data set. mse()
should take three arguments: a numeric vector of length two, the first component standing for \(y_0\) and the second for \(a\); a numerical vector containing the values of \(N\); and a numerical vector containing the values of \(Y\). The function should return a single numerical value. The latter two arguments should have as the default values the columns pop
and pcgmp
(respectively) from the gmp
data frame from lecture. Your function may not use for()
or any other loop. Check that, with the default data, you get the following values.> mse(c(6611,0.15))
[1] 207057513
> mse(c(5000,0.10))
[1] 298459915
mse <- function(para, N = gmp$pop, Y = gmp$pcgmp){
return(mean((Y - para[1]*N^para[2])^2))
}
mse(c(6611,0.15))
## [1] 207057513
mse(c(5000,0.10))
## [1] 298459914
nlm()
, or non-linear minimization. nlm()
takes two required arguments: a function, and a starting value for that function. Run nlm()
three times with your function mse()
and three starting value pairs for \(y0\) and \(a\) as innlm(mse, c(y0=6611,a=1/8))
## $minimum
## [1] 61857060
##
## $estimate
## [1] 6611.0000000 0.1263177
##
## $gradient
## [1] 50.048639 -9.983778
##
## $code
## [1] 2
##
## $iterations
## [1] 3
What do the quantities minimum
and estimate
represent? What values does it return for these?
t1 <- nlm(mse, c(y0=6611,a=1/8))
t2 <- nlm(mse, c(y0=6600,a=0.1))
t3 <- nlm(mse, c(y0=6620,a=0.15))
minimum
represents the the value of the estimated minimum of f.estimate
represents the point at which the minimum value of f is obtained.nlm()
, and the mse()
function you wrote, write a function, plm()
, which estimates the parameters \(y_0\) and \(a\) of the model by minimizing the mean squared error. It should take the following arguments: an initial guess for \(y_0\); an initial guess for \(a\); a vector containing the \(N\) values; a vector containing the \(Y\) values. All arguments except the initial guesses should have suitable default values. It should return a list with the following components: the final guess for \(y_0\); the final guess for \(a\); the final value of the MSE. Your function must call those you wrote in earlier questions (it should not repeat their code), and the appropriate arguments to plm()
should be passed on to them.plm <- function(para, N = gmp$pop, Y = gmp$pcgmp){
t <- nlm(mse, c(para[1], para[2]), N, Y)
return(list(parameters = c(t$estimate[1], t$estimate[2]), MSE = t$minimum))
}
plm(c(6611,0.15))
## $parameters
## [1] 6610.9999997 0.1263182
##
## $MSE
## [1] 61857060
plm(c(5000,0.10))
## $parameters
## [1] 5000.0000008 0.1475913
##
## $MSE
## [1] 62521484
There are multiple local optimal values. The first group of parameters has the lower MSE.
Convince yourself the jackknife can work.
mean()
and sd()
, and the formula for the standard error of the mean you learned in your intro. stats. class (or looked up on Wikipedia…).mean(gmp$pcgmp)
## [1] 32922.53
sd(gmp$pcgmp)/sqrt(nrow(gmp)) # standard error = sd/sqrt(n)
## [1] 481.9195
The standard error of the mean \((\sigma_{\bar x}) = \sqrt{\frac{\Sigma_{i=1}^n (x_i - \bar x)^2}{n(n-1)}}\)
i
, and calculate the mean per-capita GMP for every city except city number i
.mean.jackknife <- function(i, pc = gmp$pcgmp){
return(mean(pc[-i]))
}
# mean.jackknife(1)
jackknifed.means
, which has the mean per-capita GMP where every city is held out in turn. (You may use a for
loop or sapply()
.)n = nrow(gmp)
jackknifed.means <- sapply(1:n,mean.jackknife)
jackknifed.means
, calculate the jack-knife approximation to the standard error of the mean. How well does it match your answer from part (a)?sqrt((n-1)^2/n*var(jackknifed.means))
## [1] 481.9195
They are the same. Furthermore, it is easy to prove the equivalence that \[\frac{n-1}{n} \sum_{i=1}^{n}\left(\hat{\theta}_{(-i)}-\hat{\theta}_{(\cdot)}\right)^{2}=\frac{1}{n(n-1)} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\]when \(\hat\theta_{(-i)} = \displaystyle\frac{\sum_{j=1}^n x_j - x_i}{n-1}\) and \(\hat{\theta}_{(\cdot)}=\frac{1}{n} \sum_{i=1}^{n} \hat{\theta}_{(-i)}\).
Write a function, plm.jackknife()
, to calculate jackknife standard errors for the parameters \(y_0\) and \(a\). It should take the same arguments as plm()
, and return standard errors for both parameters. This function should call your plm()
function repeatedly. What standard errors do you get for the two parameters?
plm.jackknife <- function(para, N = gmp$pop, Y = gmp$pcgmp){
# para = c(6611, 0.125)
jackknife <- function(i){
return(plm(para,N[-i],Y[-i])$parameters)
}
n = length(N)
jackknifed.para <- sapply(1:n, jackknife)
y.sd <- sqrt((n-1)^2/n*var(jackknifed.para[1,]))
a.sd <- sqrt((n-1)^2/n*var(jackknifed.para[2,]))
sd <- c(y.sd,a.sd)
names(sd) = c('y.sd','a.sd')
return(sd)
}
plm.jackknife(c(6611, 0.125))
## y.sd a.sd
## 1.136653e-08 9.901003e-04
gmp-2013.dat
contains measurements for for 2013. Load it, and use plm()
and plm.jackknife
to estimate the parameters of the model for 2013, and their standard errors. Have the parameters of the model changed significantly?gmp_2013 <- read.csv("data/gmp-2013.dat", header=T, sep = " ") %>%
mutate(pop = round(gmp/pcgmp))
plm(c(6611,1/8),N = gmp_2013$pop, Y = gmp_2013$pcgmp)
## $parameters
## [1] 6611.0000002 0.1433688
##
## $MSE
## [1] 135210524
plm.jackknife(c(6611, 0.125),N = gmp_2013$pop, Y = gmp_2013$pcgmp)
## y.sd a.sd
## 2.692652e-08 1.098548e-03
We can see that parameters don’t change significiantly.