Complex surveys II: stratified random sampling
Published:
In the next couple of posts, I will introduce the general concepts of stratified and cluster random sampling.
Stratified random sampling
Rather than begin with the difference in definitions—in fact, I have never seen a generalist textbook which answers this question precisely (e.g. the sloppy Bryman 2012)—let me start with the problems that these approaches solve.
Stratifying vs. clustering: two very different approaches
Suppose that we have one of the following goals.
- To ensure that we have large enough subsamples for small groups whose representation we want to guarantee, or to make standard errors for various groups equal (requiring the taking of non-proportional samples from subsets of our population, strategically).
- To ensure that we sample every group with the same probability, so that the sample looks like the population (requiring that we take exactly proportional numbers from subsets of our population into the sample, rather than leaving this up to chance).
- To reduce variance generally: we could use, as our subsets, the group score on some variable which also predicts the outcome of interest. We could potentially do so so exactly that there is no variance within subsets (and thus no variance across samples), and we would only need to worry about bias in our sample. More realistically, the variance might be very small.
If any of these is our aim, we should stratify our sample in advance (or after it is taken). Simply put, a stratum is an outcome on some grouping variable that is relevant to some particular sample. For example, the variable may be “one’s state of residence”, a single stratum is a specific state
Suppose, instead, that we have the goal of making a sample easy to administer. We don’t want to stratify a sample by state, randomly dial up people in North Carolina, and end up with a 1000 person sample with someone in every single ZIP code. Instead, we might hope to visit a small number of large city blocks and a few rural population centers, perhaps including a handful of people from truly isolated areas. In this case, we would instead try to achieve a certain amount of randomness in our sample by making the choice of which of these clusters to include at random. Thus, we reduce the administrative burden of interviewing people across wide geographic areas, but we still include the extremely important random aspect. Clustering tends to increase sampling variance because we don’t observe every cluster, so any given sample will be more unlike other samples than in the case of simple or stratified random sampling.
Strata are often confused with clusters, in part because nothing necessarily makes something a cluster or a stratum except how they are treated in sampling. At least some people from every stratum are sampled; not every cluster is included. They share the property that we have some kind of sampling procedure at a level more abstract than the actual elements. The difference is that we have essentially a census of strata (and random selection within) and a random sample of clusters. Within a cluster, we can have random sampling or a census; these are referred to, respectively, as two-stage or one-stage cluster sampling for obvious reasons. I’ll say more about this later. Now, we’ll focus on stratified sampling.
Stratified random sampling: proportional and optimal allocation
Generally, there are two methods of stratifying which we’ll consider here. The first is that of proportional allocation, which means to use the sampling fraction
How do we actually use strata in the analysis? To find the equation for a sample mean, we need to recall what a mean really is, at least according to one approach.
We can introduce the notation for a stratified mean and prove that it is unbiased if the method of choosing subsample means is unbiased. The actual proof itself is somewhat trivial, but the first few steps, though not necessary for the proof if already known, help show what the stratified mean really is.
In one sense, a sample mean is an estimate of the expectation of a variable. This is not obvious because we often work with continuous random variables (whose expectation is given by
This becomes less trivial and more resembles the expectation formula if we have repeat values. For example, if we have a sample of size
However, applying these probabilities is extremely important if we know that we have sampled people with different probabilities, depending on their group membership. In general, if the simple sample mean formula is (in pseudo-algebra)
To fix this, we apply weights, which are simply the inverse of the biased probability. The biased probability is simply the known probability of inclusion in the sample; “biased” here need not imply “bad” or “unintended” as much as simply “unnatural”.
In nuce, we want our observed values in the stratified sample to be treatable as equally-probable realizations of the random variable; they are not treatable this way because for every person from an (under-)over-sampled group, the probability of their outcome on
Another way to understand weights that will lead us more directly to their use is the following rewrite of the sample mean which will seem pretty silly at first.
Now, this way of writing the mean seems perhaps silly at first. We begin with the mechanical definition of the mean, insert two
But, what if we have multiple subpopulations of different sizes? Then, the formula on line
Thus, we arrive at the formula for the unbiased estimate of the stratified random sample mean, for
where
Note that this definition of weights is consistent with Cochran (1977). Some authors prefer to write the weights as the inverse of the probability of selection. Note that this is simply a re-partitioning of one-and-the-same formula….
This alternative definition of weights is somewhat complex for the formulation of the sample estimate of the mean. Fortunately, it is simple for the estimate of the population total: we simply omit the denominator:
Correspondingly, the formula involving the “expected value” weights needs to be modified to be multiplied by the population size.
Note an important fact: when the sampling fraction is the same in all groups (i.e., we use the stratification method to simply ensure an exactly proportional sample), we have a self-weighting sample. The standard error, to be derived momentarily, should still be calculated with weights, but the point estimate is the simple random sample mean. The basic idea is that with proportional sampling, the overall sampling fraction
The variance of the stratified random sample and its improvement on the simple random sample
The sampling variance of a stratified random sample mean is easy to find from the above, fortunately.
The ANOVA proof
Before going further, it is important to have the analysis of variance (ANOVA) decomposition results handy. What the ANOVA decomposition tells us is essentially that the variance of a variable can be partitioned cleanly into variation within groups and variation between groups. This is true at both the population- and sample-level; here, I show the decomposition at the population level.
The rightmost term disappears because, while we sum within a group (within deviation), the group deviation is constant. So, we treat it as a constant, and the within-deviations from the group-level mean sum to zero for any group as proved earlier. Then…
We often see this simplied to the following equation…
…since the sum over
Stratified random sampling as an improvement on simple random sampling
Now we can move to the demonstration that this is an improvement on simple random sampling. We’ll start by writing
Now, finally, note that when we have proportional allocation, the following holds trues.
Thus, from 5) above, we have …
Optimal allocation
I will probably post at some point in the future a proof using Lagrange multipliers.