Power Considerations

There are five general considerations that should be made when planning an experiment:

Hypothesis
Significance level and Power
Effect size and Variability
Data to be collected and statistical analysis
Sample size

A discussion of these considerations is followed by a few examples that apply these considerations to the design of experiments.

Hypothesis

The first and most important step in planning an experiment is to define the hypothesis. The hypothesis is a statement that can be statistically tested and is usually formed using a counterfactual. A clear hypothesis is essential for the design of the experiment and the accompanying power analysis, as it will guide sample selection, data collection, and the statistical analysis.

Counterfactual

A counterfactual is a statement that describes what would have happened in the absence of a specific event or exposure. It is a fundamental concept in the design and analysis of experiments, as it allows us to compare what actually happened to what would have happened in the absence of the event or exposure. In other words, proper experimental design allows us to compare the observed outcome to the counterfactual outcome.

Some examples of hypotheses include:

Cell death increases with exposure to X. In this case, the counterfactual is not explicitly stated. Rather, it is inferred that cell death would not increase in the absence of X. This hypothesis is easily tested by splitting a population of cells into two groups and exposing one group to X. We would then compare the mean cell death in the group exposed to X to the mean cell death in the group that was not exposed to X.
The mean concentration of analyte, A, is increased in individuals with disease, D. In this case, the counterfactual consists of the cases in a world where they hadn’t been exposed to disease, D. In other words, if they hadn’t been exposed to D, the mean concentration of analyte, A, would have remained at a lower level. While we will never be able to observe an individual in the counterfactual state (because it never happened), this hypothesis can be tested by comparing the mean concentration of analyte, A, in individuals with D, to the mean concentration of A in a control population of individuals without D. Ideally, the control population should be chosen to be as similar as possible to the population of individuals with D, except for the presence of D.

Significance level and Power

Significance level and power can be thought of in terms of errors. There are two types of errors that can be made in hypothesis testing:

Type I error: Rejecting the null hypothesis when it is true. This is controlled by the significance level, α.
Type II error: Failing to reject the null hypothesis when it is false. This is controlled by the power, 1 - β.

Table 1. Type I and Type II errors. The significance level, α, controls the Type I error rate, while the power, 1 - β, controls the Type II error rate.
	Significant	Not Significant
Biologically Relevant	True Positive	False Negative (Type II error)
Biologically Irrelevant	False Positive (Type I error)	True Negative

Power

Statistical power is a function of the significance level, the effect size, the variability in the data, and the sample size. It is the probability of rejecting the null hypothesis when it is false. In other words, it can be thought of as the probability of correctly identifying a true effect given the biology and experimental design (assuming you know the true effect size and variability of the population you are sampling).

Number of hypotheses

The number of hypothesis tests you run will affect the significance level and power. The more hypothesis tests you run, the more likely you are to make a Type I error. This is known as the multiple comparisons problem. There are several methods for controlling the Type I error rate when running multiple hypothesis tests, such as the Bonferroni correction. These methods control the family-wise error rate, which is the probability of making at least one Type I error when running multiple hypothesis tests. For example:

Your favorite gene is differentially expressed in response to treatment A. This is a focuses hypothesis and would require one hypothesis test. The family-wise error rate will be the same as the significance level, α.
What genes are differentially expressed in response to treatment A? This is typical of a hypothesis generating study and would require multiple hypothesis tests. The family-wise error rate will be significantly higher than the significance level, α, and will need to be controlled using a multiple comparisons correction.

Tip

Limiting the number of hypothesis tests you run is a good way to control the family-wise error rate and increase power. This can be done by focusing on a single hypothesis or perhaps a small number of related hypotheses. For example, focusing on a specific pathway or family of pathways can result in much higher power than a hypothesis generating study that tests 20,000 individual genes.

Effect size and Variability

Effect size and variability are inherent in the biological systems we study.

Effect size is the magnitude of the difference between the observed population and the counterfactual population. In practice this is the true change that results from the exposure we are studying. In terms of our example hypotheses above, this would be the change in proportion of cells that die after exposure to X, compared to cells not exposed to X, or the change in mean concentration of A as a result of exposure to D.

Variability is the spread of the data, usually measured by the standard deviation of the population. In the context of our first hypothesis above, this gives us a measure of how different our observed proportion of cells will survive from one assay to the next. For example, if the true effect size is a rate of 20% cell death and we expose batches of 100 cells to X, we won’t always observe 80 surviving cells. If the variability is low, then we’ll see something close to 80 surviving cells from one assay to the next, perhaps ranging between 75 and 85 across many assays. If the variability is high, however, we will see more variable results, perhaps ranging between 50 and 100 surviving cells over many assays.

Estimates for effect size and variability can be obtained from the literature, pilot studies, or historical data. If these are not available, then the sample size will need to be large enough to detect a biologically relevant effect size with a reasonable degree of variability.

Tip

Effect size and variability stem from the biology, but there are still ways to increase power by reducing variability or increasing the effect size. For example:

Variability can be reduced by using a more precise assay or by careful selection of samples (e.g. matched cases and controls)
Effect size by sampling the extremes of the population or by using a more potent treatment.

Data to be collected and statistical analysis

The data to be collected and the statistical analysis used to test the hypothesis will depend on the hypothesis being tested. For example, in the case of the hypothesis, Cell death increases with exposure to X, the data to be collected would be the number of cells that die after exposure to X and the number of cells that die without exposure to X.

The statistical analysis used to test the hypothesis will depend on the type of data collected. For example, if the data is continuous (e.g. concentration of a protein), then a t-test could be used. Care should be taken to collect the data that will be needed to complete the desired statistical analysis.

Sample size

The sample size is critical in hypothesis testing, in that it affects the standard error of the statistics used to perform the statistical test. The larger the sample size, the smaller the standard error and the more precise the estimate of the effect size. A larger sample will increase the power of the test. The sample size can be calculated using a power analysis. Power analysis is a method for determining the sample size needed to detect an effect of a given size with a given degree of variability and significance level.

The sample size can also be determined by the resources available. For example, if the resources available only allow for a small sample size, then the effect size will need to be large and/or the variability will need to be small in order to detect a significant effect. If the resources available allow for a large sample size, then the effect size can be smaller and the variability greater while still being able to detect a significant effect.

Examples

Example 1: Visually exploring power

Given the hypothesis, Cell death increases with exposure to X, we will analyze the effect of these factors on power.

Significance level, α: The significance level is the probability of making a Type I error. It is the probability of rejecting the null hypothesis when it is true. α is usually set at 0.05, but can be set to any value between 0 and 1. The smaller the significance level, the smaller the probability of making a Type I error, but the larger the probability of making a Type II error.
Cell death rates (effect size): In the context of our experiment, the effect size is the magnitude of the difference between the effect of X on cell death. It is the biological difference between cells exposed to X and those same cells that were counterfactually not exposed to X. It is the expected difference in cell death rates between the exposed and unexposed groups. Set the left slider to the effect size of the unexposed group and the right slider to the effect size of the exposed group. The difference between the two sliders is the effect size.
Variability: The variability in our experiment is a measure of the standard deviation of the observed proportion of cells that die over many assays. We will assume both the exposed and unexposed groups have the same variability. In this example, the variability can be decreased by using a more precise assay (perhaps by including more cells in each assay).
Sample size (per group): In the context of our hypothesis, the statistics we are calculating are the observed difference in the proportion of cell death in exposed and in unexposed cells. The more assays we run (the sample size), the smaller the standard error of our statistics and the larger the power.

Low variability, low sample size

The default values above let us explore the scenario where the variability is low (0.1), and the sample size is low (5). We see the distribution of cell death rates and the distribution of cell death rate statistics. If the assumptions we have made hold true, this results in 61% power to detect the effect size of 0.2. How many samples would we need to achieve at least 80% power?

High variability, larger sample size

What if the variability is high (0.5)? This assumes that the variability of our assay is high. We can balance this by increasing the sample size. What sample size would we need in this case to achieve at least 80% power?

High variability, larger effect size

In the case where assay variability is high (0.5), and the sample size is low (5), we may be able increase the effect size by increasing the dose of the exposure. What effect size would we need to achieve at least 80% power?

Example 2: Exploring the effect of multiple hypotheses

In this example we will explore the effect of multiple hypotheses on the power of an experiment, ranging from a focused hypothesis (e.g. Your favorite gene is differentially expressed in response to treatment A.) to a hypothesis generating experiment (e.g. What genes are differentially expressed in response to treatment A?).

FW-Error Rate: Rather than setting the Type I error for a single test as in the previous example, we will focus on controlling the family-wise error rate. This is the probability of rejecting any of the many null hypotheses being tested when they are true (i.e. the probability of at least one false positive finding).
Effect size: In this example, we will allow for a range of effect sizes (fold-change). Three power curves will be presented, covering the range of effect sizes specified.
SD: Standard deviation of the log fold change. This is a measure of the variability in the log fold change. We will assume that the variability is the same for all genes.
Sample size: The number of samples in each group. In order to visualize how power changes with sample size, we will look at power as a function of sample size.
Number of analytes: The number of hypotheses being tested. This can range anywhere from the number of analytes in a 7K SomaScan to all analytes in associated with a particular system a single gene pathway (1).

Hypothesis generating with variable effect size

The default values let us explore the scenario where we want to explore many hypotheses under varying assumptions of effect size. The minimum effect size is small, a 1.5-fold change. The maximum effect size is a fold-change of 4, which is a large effect. We see with this wide range of effect sizes, a wide range of sample sizes would be needed to achieve at least 80% power. What effect does the standard deviation have on power?

Focused hypothesis vs hypothesis generating

Reset the defaults and set the range of effect sizes to [1.5, 2.5] and the sample size range maximum to 50. If the exposure or disease we are studying has a small to moderate effect and we are limited to a moderate sample size, we would not have much power to detect the analytes that are actually modified by the exposure. What if we were to focus on a small number of hypotheses? How small would the number of hypotheses need to be to achieve at least 80% power? Some large changes in the number of hypotheses (say 7,000 to 5,000) don’t have much effect on power in this scenario, but a much smaller change at the lower end of the spectrum (say 1 to 50) has a large effect on power.