We continue examining the diffusion of tetracycline among doctors in Illinois in the early 1950s, building on our work in lab 6. You will need the data sets ckm_nodes.csv
and ckm_network.dat
from the labs.
ckm_nodes <- read.csv("data/ckm_nodes.csv", header=T, sep = ",")
ckm_net <- read.table("./data/ckm_network.dat")
idx <- !is.na(ckm_nodes$adoption_date)
ckm_nodes <- ckm_nodes %>% filter(!is.na(adoption_date))
ckm_net <- ckm_net[idx,idx]
There are 6 variables, so the dataframe has 6 columns.
For every doctor, for every month \(\Rightarrow\) 125 doctors \(\times\) 17 months = 2125, so the datafram has 2125 rows.
df <- data.frame(doctor = rownames(ckm_nodes)) %>%
slice(rep(1:n(), each = 17)) %>%
mutate(month = rep(1:17,length.out=n()))
df <- df %>% mutate(prescribe_that_month = as.numeric(ckm_nodes[doctor,2]==month),
prescribe_before = as.numeric(ckm_nodes[doctor,2]<month))
# df <- mutate(df, contacts_before =
# sum(ckm_nodes[unlist(ckm_net[as.numeric(doctor),]) == 1, 2]
# < as.numeric(month)))
f <- function(x){
return(sum(ckm_nodes[ckm_net[as.numeric(x[1]),] == 1, 2] < as.numeric(x[2])))
}
df <- df %>% mutate(contacts_str_before = apply(df, 1, f))
f <- function(x){
return(sum(ckm_nodes[ckm_net[as.numeric(x[1]),] == 1, 2] <= as.numeric(x[2])))
}
df <- df %>% mutate(contacts_in_before = apply(df, 1, f))
Let \[ p_k = \Pr(\text{A doctor starts prescribing tetracycline this month} \mid \\ \text{Number of doctor's contacts prescribing before this month}=k) \] and \[ q_k = \Pr(\text{A doctor starts prescribing tetracycline this month} \mid \\ \text{Number of doctor's contacts prescribing this month}=k) \] We suppose that \(p_k\) and \(q_k\) are the same for all months.
max(apply(ckm_net, 1, sum))
## [1] 20
Because the maximum of the contacts of a doctor is 20, so the possible values of \(k\) are from 0 to 20, the number of which is no more than \(21\).
Attention: There can be different understanding as for the expression the number of prior-adoptee contacts \(k\). Intuitively, it could just mean \(k\) that is from 0 to 20 and I heard many classmates did it like this. But the ambiguity lies in c that says the number of prior-or-contemporary-adoptee contacts \(k\) while \(k\) should just stand for contemporary-adoptee contacts in \(q_k\). So another sensible explaination is that we should calculate the number of doctors whose prior-adoptee or prior-or-contemporary-adoptee contacts number equals \(k\). From my perspective, I prefer the latter interpretation, and besides this calculation is used in the last problem. Anyway, I plot both of the understandings.
p.vec <- vector(mode = "numeric",length = 21)
k.vec <- p.vec
for(k in 0:20){
idx <- df$contacts_str_before == k
k.vec[k+1] <- sum(idx)
if(k.vec[k+1] == 0){
p.vec[k+1] <- NA
next
}
dfk <- df[idx,]
p1 <- sum(dfk$prescribe_that_month == 1)
p.vec[k+1] <- p1/k.vec[k+1]
}
k <- c(0:20)
par(mfrow=c(1,2))
plot(k.vec,p.vec, xlab="num k", ylab="p")
plot(k,p.vec, xlab="k", ylab="p")
p.vec2 <- vector(mode = "numeric",length = 21)
k.vec2 <- p.vec2
for(k in 0:20){
idx <- (df$contacts_in_before - df$contacts_str_before) == k
k.vec2[k+1] <- sum(idx)
if(k.vec2[k+1] == 0){
p.vec2[k+1] <- NA
next
}
dfk <- df[idx,]
p1 <- sum(dfk$prescribe_that_month == 1)
p.vec2[k+1] <- p1/k.vec2[k+1]
}
k <- c(0:20)
par(mfrow=c(1,2))
plot(k.vec2 + k.vec, p.vec2, xlab="num k", ylab="q")
plot(k, p.vec2, xlab="k", ylab="q")
Because it only conditions on information from the previous month, \(p_k\) is a little easier to interpret than \(q_k\). It is the probability per month that a doctor adopts tetracycline, if they have exactly \(k\) contacts who had already adopted tetracycline.
df.p <- data.frame(k = 0:20, p = p.vec)
m.1 <- lm(p ~ k, data = df.p)
summary(m.1)
##
## Call:
## lm(formula = p ~ k, data = df.p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.030334 -0.014584 -0.002344 0.005534 0.048694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0569324 0.0090507 6.290 1.45e-05 ***
## k -0.0037997 0.0009184 -4.137 0.000877 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02015 on 15 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.533, Adjusted R-squared: 0.5018
## F-statistic: 17.12 on 1 and 15 DF, p-value: 0.0008773
Estimate Std. Error t value Pr(>|t|)
a 0.0569324 0.0090507 6.290 1.45e-05 ***
b -0.0037997 0.0009184 -4.137 0.000877 ***
It is a logistic curve. Suppose \(b>0\). As \(k\) grows, the initial stage of growth is approximately exponential (geometric); then, as saturation begins, the growth slows to linear (arithmetic), and at maturity, growth stops.
# logistic.nls
f <- function(k,a,b){
return(exp(a+b*k)/(1+exp(a+b*k)))
}
m.2 <- nls(p ~ f(k, a, b), data = df.p, start = list(a = 0, b = -0.2))
# logistic.lm --convert to ak+b = f(p)
m.3 <- lm(p.log ~ k, df.p %>%
mutate(p.log = ifelse(p==0, log(0.0001/(1-0.0001)), log(p/(1-p)))))
# m.4 <- glm(p ~ k, df.p, family = "binomial")
# maybe type = 'response' in prediction could work for glm
summary(m.2)
##
## Formula: p ~ f(k, a, b)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a -2.56508 0.20610 -12.446 2.62e-09 ***
## b -0.17051 0.05371 -3.174 0.00628 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01957 on 15 degrees of freedom
##
## Number of iterations to convergence: 6
## Achieved convergence tolerance: 1.449e-07
## (4 observations deleted due to missingness)
m1 = predict(m.1, newdata=data.frame(k=c(0:20)))
m2 = predict(m.2, newdata=data.frame(k=c(0:20)))
df.p <- mutate(df.p, linear = m1, logistic = m2)
t = predict(m.3, newdata=data.frame(k=c(0:20)))
m3 = exp(t)/(1+exp(t))
df.p <- mutate(df.p, linear = m1, logistic = m2, logistic.2 = m3)
# na.omit(df.p) %>% ggplot(aes(x = k)) + geom_line(aes(y = p)) +
# geom_line(aes(y = predict.a)) + geom_line(aes(y = predict.b))
df.tidy <- df.p %>% gather(model, res, -k) %>% na.omit()
df.tidy %>% ggplot() + geom_point(aes(x = k, y = res, color = model), size = 3) +
geom_line(aes(x = k, y = res, color = model), size = 1) +
labs(y = "Probability",
title = "Prediction of linear and logistic models")
The linear model is the worst. I use two methods to establish the logistic model and I think green line logistic
using function nls()
looks better on the whole. Because logistic.2
seems a little overfitting in the tail part.
For quibblers, pedants, and idle hands itching for work to do: The \(p_k\) values from problem 3 aren’t all equally precise, because they come from different numbers of observations. Also, if each doctor with \(k\) adoptee contacts is independently deciding whether or not to adopt with probability \(p_k\), then the variance in the number of adoptees will depend on \(p_k\). Say that the actual proportion who decide to adopt is \(\hat{p}_k\). A little probability (exercise!) shows that in this situation, \(\mathbb{E}[\hat{p}_k] = p_k\), but that \(\mathrm{Var}[\hat{p}_k] = p_k(1-p_k)/n_k\), where \(n_k\) is the number of doctors in that situation. (We estimate probabilities more precisely when they’re really extreme [close to 0 or 1], and/or we have lots of observations.) We can estimate that variance as \(\hat{V}_k = \hat{p}_k(1-\hat{p}_k)/n_k\). Find the \(\hat{V}_k\), and then re-do the estimation in (4a) and (4b) where the squared error for \(p_k\) is divided by \(\hat{V}_k\). How much do the parameter estimates change? How much do the plotted curves in (4c) change?
Probability exercise: Assume there are \(n_k\) observations \(x_1,x_2,...,x_{n_k}\). And \(x_i = 1\) if the doctor adopts in month \(k\), while \(x_i = 0\) on the other side. So \(x_i\in\lbrace 0,1\rbrace\). The distribution is determined by \(P(x_i=1)=p_k\) and \(P(x_i=0)=1-p_k\). It is exactly a binomial distribution. So \(E(x_i) = p_k\) and \(Var(x_i) = p_k(1-p_k)\).
We can see that the proportion \(\hat p_k = \displaystyle\frac{x_1+x_2+...+x_{n_k}}{n_k}\). Therefore, \(\mathbb E(\hat p) = \displaystyle\frac{n\cdot E(x_i)}{n} = p_k\) and \(Var(\hat p) = \displaystyle\frac{1}{n_k^2}\cdot n_k\cdot Var(x_i) = p_k(1-p_k)/n_k\).
idx <- !(is.na(p.vec) | p.vec == 0)
wt <- p.vec*(1-p.vec)/k.vec
# We can only take data!=0 because we just can't get p=0 more precisely
# and thus the variance of that is meaningless.
m.w1 <- lm(p ~ k, data = df.p[idx,], weight = 1/wt[idx])
summary(m.w1)
##
## Call:
## lm(formula = p ~ k, data = df.p[idx, ], weights = 1/wt[idx])
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.67043 -0.30912 0.01392 0.83562 1.15775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.068669 0.006248 10.991 1.14e-05 ***
## k -0.008548 0.001922 -4.448 0.00298 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.772 on 7 degrees of freedom
## Multiple R-squared: 0.7387, Adjusted R-squared: 0.7013
## F-statistic: 19.79 on 1 and 7 DF, p-value: 0.002978
m.w2 <- nls(p ~ f(k, a, b), data = df.p[idx,],
start = list(a = 0, b = -0.2), weight = 1/wt[idx])
summary(m.w2)
##
## Formula: p ~ f(k, a, b)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a -2.48943 0.07804 -31.901 7.69e-09 ***
## b -0.23754 0.03743 -6.346 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5513 on 7 degrees of freedom
##
## Number of iterations to convergence: 8
## Achieved convergence tolerance: 7.384e-07
The change of parameters:
Parameters | Linear | Linear.weight | Logistic | Logistic.weight |
---|---|---|---|---|
a | 0.0569324 | 0.068669 | -2.56508 | -2.48943 |
b | -0.0037997 | -0.008548 | -0.17051 | -0.23754 |
w1 = predict(m.w1, newdata=data.frame(k=c(0:20)))
w2 = predict(m.w2, newdata=data.frame(k=c(0:20)))
df.w <- data.frame(k = 0:20, p = p.vec)
df.w <- mutate(df.w, linear = w1, logistic = w2)
dft <- df.w %>% gather(model, res, -k) %>% na.omit()
dft %>% ggplot() + geom_point(aes(x = k, y = res, color = model), size = 3) +
geom_line(aes(x = k, y = res, color = model), size = 1) +
labs(y = "Probability",
title = "Prediction of linear and logistic models with weight")
The change of plots: Both curves are more fitting when \(k\) is small and seem to somehow neglect the dots in the middle-top. From my analysis, the reason is that smaller samples cause larger estimated variances, which reduce the weight of these points. And there are more observations when \(k\) is small, which adds to the weight.