Effect Size Calculator (Cohen's d)
Quantify how large the difference between two independent groups really is. Compute Cohen's d, the bias-corrected Hedges' g, and the pooled standard deviation from summary statistics.
What You Need Before You Start
- The mean, standard deviation, and sample size of each of the two groups.
- Groups must be independent (different participants), not paired or repeated measurements.
- Standard deviations should be sample statistics (computed with the n − 1 denominator).
Group 1
Group 2
Related Calculators
What Cohen's d Measures
A raw mean difference is hard to judge on its own: is a 7-point gap between two classes large? It depends entirely on how much scores vary in the first place. Cohen's d resolves this by expressing the gap in units of standard deviation — a d of 0.7 means the group means sit 0.7 pooled standard deviations apart, whatever the original measurement scale was.
Because it is scale-free, d serves two purposes raw differences cannot. It lets you compare effects measured on different instruments (a reading intervention scored in one test against a math intervention scored in another), and it is the common currency of meta-analysis, where results from many studies are pooled into a single estimate.
The Formulas: Pooled SD, d, and Hedges' g
The calculator runs three computations in sequence:
Effect Size Formulas:
sₚ = √(((n₁ − 1)s₁² + (n₂ − 1)s₂²) / (n₁ + n₂ − 2))
d = (m₁ − m₂) / sₚ
g = d × (1 − 3 / (4(n₁ + n₂) − 9))
Symbols:
- • m₁, m₂ = group means
- • s₁, s₂ = group standard deviations
- • n₁, n₂ = group sample sizes
- • sₚ = pooled standard deviation
The pooled standard deviation is a weighted average of the two group variances, with each group weighted by its degrees of freedom (n − 1). This gives larger groups proportionally more influence over the yardstick used to standardize the difference.
Interpreting the Magnitude
The interpretation labels follow the benchmarks Jacob Cohen proposed in his 1988 book Statistical Power Analysis for the Behavioral Sciences: |d| below 0.2 is negligible, 0.2 to 0.5 small, 0.5 to 0.8 medium, and 0.8 or above large. A helpful intuition: at d = 0.8, the average member of the higher group scores above roughly 79% of the lower group; at d = 0.2, above only about 58%.
Cohen himself warned that these are conventions of last resort, not laws. In fields where interventions are cheap and outcomes matter at scale — education, public health — a “small” d of 0.2 can be practically important, while in tightly controlled laboratory work a d of 0.5 might be unremarkable. Always interpret the number against effects typical of your own field.
Cohen's d vs. Hedges' g
Cohen's d has a known flaw: with small samples it systematically overestimates the population effect size, because the pooled standard deviation is itself a noisy estimate. Hedges' g applies a correction factor — always slightly below 1 — that removes most of this bias. The two converge as samples grow: with 60 total participants the correction is about 1.3%, and with 300 it drops below 0.3%.
A sensible reporting rule: quote g when either group has fewer than about 20 participants or when contributing to a meta-analysis, and note that many journals now request g by default. Since g is never larger in magnitude than d, reporting g is also the conservative choice.
Effect Size and Statistical Significance
A t-test answers a yes-or-no question — is the difference distinguishable from zero? — and its p-value depends heavily on sample size. With thousands of participants, a trivial d of 0.05 becomes “highly significant”; with ten participants, a substantial d of 0.7 may fail to reach significance. Neither p-value tells you what you usually want to know: how big the effect is.
Effect size and significance therefore answer complementary questions, and modern reporting standards (including APA style) expect both. The pairing also powers study planning: a power analysis takes the effect size you hope to detect and returns the sample size required to detect it reliably.
Worked Example: Comparing Two Teaching Methods
An education researcher compares final exam scores. The new method group has m₁ = 85, s₁ = 10, n₁ = 30; the traditional group has m₂ = 78, s₂ = 12, n₂ = 30. The calculator proceeds step by step:
- Weight each variance: (30 − 1) × 10² = 29 × 100 = 2900 and (30 − 1) × 12² = 29 × 144 = 4176.
- Pooled variance: (2900 + 4176) / (30 + 30 − 2) = 7076 / 58 = 122.
- Pooled SD: √122 ≈ 11.0454.
- Cohen's d: (85 − 78) / √122 = 7 / 11.045361 = 0.63375, displayed as 0.6338 — a medium effect.
- Correction factor: 1 − 3 / (4 × 60 − 9) = 1 − 3/231 ≈ 0.987013.
- Hedges' g: 0.63375 × 0.987013 ≈ 0.6255.
Interpretation: the new method's average sits about 0.63 pooled standard deviations above the traditional method's — a medium effect by Cohen's benchmarks, meaning the typical student under the new method outscored roughly 74% of students under the old one. The bias-corrected g of 0.6255 is the number to carry into a meta-analysis.
Frequently Asked Questions
What is the difference between Cohen's d and Hedges' g?
Both express the mean difference in pooled standard deviation units. Cohen's d is the raw ratio, which slightly overestimates the true effect in small samples. Hedges' g multiplies d by a correction factor, 1 - 3/(4(n1+n2) - 9), that removes most of that bias. With large samples the two are nearly identical; with small samples g is the better estimate to report.
Can Cohen's d be negative, and what would that mean?
Yes. The sign simply reflects the direction of subtraction: d is negative whenever group 2 has the larger mean. Magnitude is judged on the absolute value, so d = -0.6 is a medium effect favoring group 2. Report the sign together with a clear statement of which group came first.
Are the 0.2, 0.5, and 0.8 thresholds universal rules?
No. They are conventions Cohen proposed in 1988 for situations where no better context exists, and he explicitly cautioned against treating them as absolute. Typical effect sizes differ across disciplines, so the most meaningful comparison is against effects previously observed in your own research area.
Can I use this calculator for paired or repeated-measures data?
Not directly. This tool computes d for two independent groups using the pooled standard deviation. Paired designs (the same participants measured twice) call for a different denominator, typically the standard deviation of the difference scores or the average within-pair SD, and applying the independent-groups formula to paired data misstates the effect.
How does effect size relate to the sample size I need?
Inversely: smaller expected effects require larger samples to detect reliably. As a rough guide from power analysis, detecting d = 0.8 at 80% power with a two-sided 5% test needs about 26 participants per group, d = 0.5 needs about 64 per group, and d = 0.2 needs about 394 per group.