Statistics Reference

Hypothesis Testing

Hypothesis testing is a structured way to ask whether a pattern in data is strong enough to count as evidence against a baseline explanation. It does not prove a claim with certainty, but it helps you judge whether observed differences are more likely to reflect signal than random variation.

What It Answers

Most tests begin with a comparison between what you observed and what would be expected if nothing interesting were happening. In statistics, that default position is written as the null hypothesis, often called H0.

The competing claim is the alternative hypothesis, orH1. A test then asks whether the data are sufficiently incompatible with H0 that you would rather treat the alternative as the better working explanation.

Core Pieces

Test statistic: A number such as a z-value, t-value, or chi-square value that summarizes how far the data sit from what the null model would expect.

P-value: The probability of seeing a result at least this extreme if the null hypothesis were true. Small p-values suggest that the observed result would be unusual under the null model.

Significance threshold: A decision rule such as 0.05 or 0.01 used before looking at the result. It is a policy choice, not a law of nature.

Conclusion: Either reject the null hypothesis or fail to reject it. Failing to reject is not the same as proving no effect exists.

How To Interpret a P-Value

A p-value does not tell you the probability that the null hypothesis is true. It tells you how surprising the data would be if the null hypothesis were true.

Useful translation

If p = 0.03, the data would be fairly unusual under the null model, assuming the model and test assumptions are appropriate.

That does not mean there is a 97% chance the alternative is correct. It means the observed result is not very consistent with the null model.

Confidence Intervals Matter Too

P-values answer a decision-style question. Confidence intervals add the part many readers actually need: the plausible range of effect sizes consistent with the data and model.

Use the p-value to judge compatibility with the null model.
Use the interval to judge size, direction, and practical relevance.
Report both when possible instead of using a significance label alone.

Choosing the Right Test

T-Test

Use when comparing means for one sample, paired measurements, or two groups under assumptions that make the t framework reasonable.

ANOVA

Use when comparing mean differences across more than two groups while separating within-group and between-group variation.

Chi-Square

Use for categorical counts, goodness-of-fit checks, or independence questions in contingency tables.

Confidence Interval Workflows

Use when estimating a range for a mean, proportion, or effect size, especially when practical magnitude matters as much as significance.

Test choice depends on variable type, study design, independence, approximate distribution assumptions, and whether you are comparing means, proportions, or counts.

Errors, Power, and Sample Size

Type I error: Rejecting the null hypothesis when it is actually true. The significance level controls the long-run rate of this error under the testing model.

Type II error: Failing to reject the null hypothesis when a real effect exists.

Power: The probability that your test detects an effect of a given size. Low-powered studies can miss meaningful effects and also produce unstable estimates.

This is why sample size planning belongs before data collection, not after a disappointing p-value.

A Complete Worked Example

A cafe advertises an average order wait of 4 minutes. A skeptical analyst times 36 random orders and finds a sample mean of 4.5 minutes with a sample standard deviation of 1.2 minutes. Is the advertised claim still believable?

Hypotheses: H0: μ = 4 (the claim is accurate) versus H1: μ ≠ 4 (the true average differs).
Standard error: SE = 1.2 ÷ √36 = 1.2 ÷ 6 = 0.2 minutes.
Test statistic: t = (4.5 − 4) ÷ 0.2 = 2.5 with 35 degrees of freedom.
P-value: the two-tailed probability of |t| ≥ 2.5 is about 0.017.
Decision: 0.017 < 0.05, so reject H0 at the 5% level.
Effect size check: the 95% confidence interval for the true mean is 4.5 ± 2.03 × 0.2 ≈ (4.09, 4.91) minutes, which excludes 4.

The statistical conclusion is that waits genuinely exceed 4 minutes. The practical conclusion needs context: the plausible excess is somewhere between roughly 5 and 55 seconds. Whether that matters is a business judgment the p-value cannot make for you — which is exactly why effect sizes and intervals belong next to every test result.

A Practical Workflow

State the null and alternative hypotheses in plain language.
Choose the test before looking for a favorable result.
Check whether the assumptions are reasonable for your data.
Compute the test statistic, p-value, and confidence interval.
Interpret statistical evidence and practical importance together.
Report limits, sample size, and any data exclusions transparently.

Common Mistakes

Treating a p-value just below 0.05 as automatic proof.
Ignoring effect size because the test was significant.
Running many tests and only reporting the favorable ones.
Choosing a test after seeing the data pattern you want to support.
Calling a non-significant result proof of no difference.

Try the P-Value Calculator

Check p-values for common test scenarios, then pair the result with interval estimates and study context.

References

Casella and Berger. Statistical Inference.
Moore, McCabe, and Craig. Introduction to the Practice of Statistics.
Wasserstein and Lazar. The ASA Statement on p-Values: Context, Process, and Purpose.

Last reviewed: April 15, 2026

Maintained by MathCalculate Editorial as part of the public quantitative reference library.