Oct 21
I wasn't totally sure of the best thing to do this week, so I worked on a few different things. I went back and looked at the sample variance estimator that I had been using and I'm now a lot more confident in it. It doesn't seem to be an extremely stable estimator for the true population variance but I think with large enough populations it is workable. If I have time I may explore variance estimation more thoroughly, but the verification of the method is a much higher priority for me, and one I'm a lot less sure how to do.
I also did some experiments with the estimates. From what I know of ratio estimation, the most precise auxiliary variables (cluster estimates) will have a correlation close to 1 and a line of best fit with a y-intercept of 0. I did looked at changing the upper and lower precision between .5 and .95 in .1 increments, and looked at how variance, correlation of estimates and actual populations, and y-intercept changed with these. I made plots of all of these (to be honest these experiments were largely to get practice making 3-d plots in R). I found that the lower limit changing affects variance significantly more than the upper limit. This seems somewhat odd that either upper or lower accuracy has a larger effect, since I would expect them to impact variance in the same way, but I haven't explored it too closely.
The main two things I'm hoping to look at from here on out are verifying the method and how estimation affects sampling, with verification being a higher priority.
with Jim in class
- "valid" isn't well defined, particularly because the method depends on the specifics of the estimation.
- "useful" is probably a better notion to try to get hold of.
- Rather than using 0.5 below to 0.5 above, which is really x/2 to 1.5*x, since the method relies on scaling, it's prob'ly better to use a constant scaling above and below some uniformly biased number ... particularly since your method assumes that there is such a uniform bias.
- A next-more-complicated but perhaps more realistic approach would be to scale the unmeasured estimated values by a population of scaling factors rather than just one, such population generated by (mean, std_dev) of sampled values. That would give not one estimated set of values but a population of such estimates, from which you could estimate high/low/variance etc of the result.
- One place you could look at a specific estimate method - if this does use census data as an example - would be to pull old US census data from a number of geographic regions, and try to predict (say) 1960 from 1950, and investigate how well the scheme works.