Non-Significant ≠ No Effect

A Permutation-Based Demonstration

Published

February 3, 2026

Show code
library(tidyverse)
library(infer)
library(patchwork)
library(scales)

# Set seed for reproducibility
set.seed(42)

# Define consistent styling (matches ghrbook.com)
theme_set(theme_minimal(base_size = 14))
treatment_color <- "#0284c7"   # sky-600
control_color <- "#E69F00"     # orange
null_color <- "#94a3b8"        # slate-400

The Setup

Imagine a clinical trial testing whether a brief counseling intervention improves patient outcomes on a 0-10 scale. We’ll create four hypothetical trials—all with nearly identical point estimates—but with different sample sizes: 50, 150, 500, and 1,000 participants.

The key lesson: a “non-significant” result doesn’t mean there’s no effect—it often means we didn’t have enough data to detect it.

This demonstration uses permutation-based hypothesis testing, following the approach in ModernDive Chapter 9.

Creating the Datasets

We’ll generate data where the treatment group scores, on average, about 0.5 points higher than the control group on a 0-10 outcome scale. This is a modest but potentially meaningful clinical effect.

Show code
# Parameters
effect_target <- 0.5
outcome_sd <- 2.5
control_mean <- 5.0

# Function to generate trial data
generate_trial <- function(n, seed) {
  set.seed(seed)
  n_per_arm <- n / 2

  tibble(
    id = 1:n,
    arm = rep(c("Control", "Treatment"), each = n_per_arm),
    outcome = c(
      pmin(10, pmax(0, rnorm(n_per_arm, mean = control_mean, sd = outcome_sd))),
      pmin(10, pmax(0, rnorm(n_per_arm, mean = control_mean + effect_target, sd = outcome_sd)))
    )
  )
}

# Seeds selected to produce nearly identical point estimates (~0.47)
# across all sample sizes (range < 0.001)
seeds <- c(345, 1848, 763, 184)

datasets <- list(
  n50 = generate_trial(50, seeds[1]),
  n150 = generate_trial(150, seeds[2]),
  n500 = generate_trial(500, seeds[3]),
  n1000 = generate_trial(1000, seeds[4])
)

# Calculate observed point estimates
observed_effects <- map_dfr(datasets, function(df) {
  df %>%
    group_by(arm) %>%
    summarise(mean = mean(outcome), .groups = "drop") %>%
    summarise(effect = mean[arm == "Treatment"] - mean[arm == "Control"])
}, .id = "sample_size")

observed_effects <- observed_effects %>%
  mutate(
    n = c(50, 150, 500, 1000),
    sample_size = factor(sample_size,
                         levels = c("n50", "n150", "n500", "n1000"),
                         labels = c("n = 50", "n = 150", "n = 500", "n = 1,000"))
  )

knitr::kable(observed_effects %>%
               select(`Sample Size` = sample_size,
                      `Point Estimate` = effect) %>%
               mutate(`Point Estimate` = round(`Point Estimate`, 3)),
             caption = "Point estimates (treatment effect) across the four datasets")
Point estimates (treatment effect) across the four datasets
Sample Size Point Estimate
n = 50 0.471
n = 150 0.471
n = 500 0.471
n = 1,000 0.471

Visualizing the Raw Data

Before we test anything, let’s look at the data. Each dot represents one participant’s outcome score.

Show code
# Combine all datasets for plotting
all_data <- bind_rows(
  datasets$n50 %>% mutate(sample_size = "n = 50"),
  datasets$n150 %>% mutate(sample_size = "n = 150"),
  datasets$n500 %>% mutate(sample_size = "n = 500"),
  datasets$n1000 %>% mutate(sample_size = "n = 1,000")
) %>%
  mutate(sample_size = factor(sample_size,
                              levels = c("n = 50", "n = 150", "n = 500", "n = 1,000")))

# Calculate means for annotation
means_data <- all_data %>%
  group_by(sample_size, arm) %>%
  summarise(mean_outcome = mean(outcome), .groups = "drop")

ggplot(all_data, aes(x = arm, y = outcome, fill = arm, color = arm)) +
  geom_violin(alpha = 0.3, color = NA) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1) +
  geom_point(data = means_data, aes(y = mean_outcome),
             size = 4, shape = 18, color = "black") +
  geom_label(data = means_data,
             aes(y = mean_outcome, label = round(mean_outcome, 2)),
             size = 3.5, fontface = "bold",
             fill = "white", color = "black",
             label.padding = unit(0.2, "lines"),
             nudge_y = 1,
             show.legend = FALSE) +
  scale_fill_manual(values = c("Control" = control_color, "Treatment" = treatment_color)) +
  scale_color_manual(values = c("Control" = control_color, "Treatment" = treatment_color)) +
  facet_wrap(~sample_size, nrow = 2) +
  labs(
    title = "Outcome Scores by Treatment Arm",
    subtitle = "Violins show distributions; diamonds mark group means. The point estimate is nearly identical across all sample sizes.",
    x = NULL,
    y = "Outcome (0-10 scale)"
  ) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold", size = 12))

Look at the data. In each panel, the treatment group (blue) tends to score slightly higher than the control group (orange) on average. The point estimate is nearly identical across all four sample sizes. Yet as we’ll see, only the larger samples will yield “statistically significant” results.

The Permutation Test

Here’s the logic: If the treatment has no effect, then it shouldn’t matter which participants are labeled “Treatment” vs. “Control”. We can shuffle the labels to mimic the null hypothesis.

We’ll shuffle the labels 1,000 times for each dataset to build a null distribution—what effect sizes we’d expect to see if there were truly no difference. Then we’ll see where our actual observed effects fall in those distributions.

Show code
# Function to run permutation test and extract results
run_permutation <- function(df, n_reps = 1000) {

  # Calculate observed statistic
  obs_stat <- df %>%
    specify(outcome ~ arm) %>%
    calculate(stat = "diff in means", order = c("Treatment", "Control"))

  # Generate null distribution via permutation
  set.seed(123) # For reproducible null distributions
  null_dist <- df %>%
    specify(outcome ~ arm) %>%
    hypothesize(null = "independence") %>%
    generate(reps = n_reps, type = "permute") %>%
    calculate(stat = "diff in means", order = c("Treatment", "Control"))

  # Get p-value (two-tailed)
  p_val <- null_dist %>%
    get_p_value(obs_stat = obs_stat, direction = "two-sided")

  list(
    observed = obs_stat$stat,
    null_distribution = null_dist,
    p_value = p_val$p_value
  )
}

# Run permutation tests for all datasets
results <- map(datasets, run_permutation)
names(results) <- c("n = 50", "n = 150", "n = 500", "n = 1,000")

# Extract summary
results_summary <- tibble(
  sample_size = names(results),
  point_estimate = map_dbl(results, "observed"),
  p_value = map_dbl(results, "p_value")
) %>%
  mutate(
    significant = ifelse(p_value < 0.05, "Yes (p < 0.05)", "No (p ≥ 0.05)"),
    sample_size = factor(sample_size, levels = c("n = 50", "n = 150", "n = 500", "n = 1,000"))
  )

knitr::kable(
  results_summary %>%
    mutate(
      point_estimate = round(point_estimate, 3),
      p_value = ifelse(p_value < 0.001, "< 0.001", as.character(round(p_value, 3)))
    ) %>%
    select(
      `Sample Size` = sample_size,
      `Point Estimate` = point_estimate,
      `P-value` = p_value,
      `Statistically Significant?` = significant
    ),
  caption = "Permutation test results: Same point estimate, different conclusions"
)
Permutation test results: Same point estimate, different conclusions
Sample Size Point Estimate P-value Statistically Significant?
n = 50 0.471 0.46 No (p ≥ 0.05)
n = 150 0.471 0.218 No (p ≥ 0.05)
n = 500 0.471 0.036 Yes (p < 0.05)
n = 1,000 0.471 0.002 Yes (p < 0.05)

Visualizing the Null Distributions

Now let’s visualize where our observed effects (the dashed lines) fall within the null distributions. Each dot represents the effect size from one of 1,000 random shuffles—what we’d expect if the treatment had no effect.

Notice how the null distributions get narrower as sample size increases. With only 50 participants, random shuffling can easily produce large differences between groups just by chance—that’s sampling variation with small samples. Larger differences are simply not surprising under the null when you have little data. But with 1,000 participants, the shuffled differences cluster tightly around zero, and a difference of 0.47 becomes noteworthy. This is why the same point estimate can be “significant” in a large study but not in a small one: it’s not about the size of the effect, it’s about how surprising that effect is given how much noise we expect under the null.

Show code
# Create dot-based visualization of null distributions
# Fixed x-axis range for all plots
x_range <- c(-2.25, 2.25)

plot_null_dots <- function(result, label, obs_effect, p_val) {

  null_df <- result$null_distribution %>%
    arrange(stat) %>%
    mutate(
      # Use fixed breaks based on common x-axis range
      bin = cut(stat, breaks = seq(x_range[1], x_range[2], length.out = 41), labels = FALSE)
    ) %>%
    filter(!is.na(bin)) %>%
    group_by(bin) %>%
    mutate(
      y_pos = row_number(),
      bin_center = mean(stat)
    ) %>%
    ungroup()

  # Determine if result crosses threshold
  p_display <- ifelse(p_val < 0.001, "< 0.001", round(p_val, 3))
  sig_label <- paste0("p = ", p_display)
  sig_color <- ifelse(p_val < 0.05, treatment_color, "#666666")

  ggplot(null_df, aes(x = bin_center, y = y_pos)) +
    geom_point(color = null_color, size = 1.5, alpha = 0.7) +
    geom_vline(xintercept = obs_effect, linetype = "dashed",
               color = treatment_color, linewidth = 1.2) +
    geom_vline(xintercept = 0, color = "black", linewidth = 0.5) +
    annotate("text", x = obs_effect, y = max(null_df$y_pos) * 0.95,
             label = paste0("Point estimate\n= ", round(obs_effect, 2)),
             hjust = -0.1,
             color = treatment_color, fontface = "bold", size = 3.5) +
    annotate("label", x = x_range[1] + 0.1, y = max(null_df$y_pos) * 0.9,
             label = sig_label, hjust = 0, fill = "white",
             color = sig_color, fontface = "bold", size = 4) +
    scale_x_continuous(limits = x_range) +
    labs(
      title = label,
      x = "Effect size (Treatment - Control)",
      y = "Count"
    ) +
    theme(
      plot.title = element_text(face = "bold"),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )
}

# Create individual plots
p1 <- plot_null_dots(results[["n = 50"]], "n = 50",
                     results[["n = 50"]]$observed,
                     results[["n = 50"]]$p_value)
p2 <- plot_null_dots(results[["n = 150"]], "n = 150",
                     results[["n = 150"]]$observed,
                     results[["n = 150"]]$p_value)
p3 <- plot_null_dots(results[["n = 500"]], "n = 500",
                     results[["n = 500"]]$observed,
                     results[["n = 500"]]$p_value)
p4 <- plot_null_dots(results[["n = 1,000"]], "n = 1,000",
                     results[["n = 1,000"]]$observed,
                     results[["n = 1,000"]]$p_value)

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "Null Distributions from Permutation Tests",
    subtitle = "Each dot = one shuffled result. Dashed line = observed point estimate. Same effect, different p-values.",
    theme = theme(
      plot.title = element_text(face = "bold", size = 16),
      plot.subtitle = element_text(size = 12, color = "gray40")
    )
  )

The Key Insight

Look at what just happened:

  • Same point estimate (~0.47 points) across all four trials
  • Different p-values depending on sample size
  • Different conclusions about “statistical significance”

With n = 50 and n = 150, we fail to reject the null hypothesis. A naive interpretation might be: “The treatment doesn’t work.” But that’s wrong. The point estimate shows an effect, but there’s not enough data to rule out that the difference is 0.

With n = 500 and n = 1,000, we reject the null hypothesis. The treatment effect is now “statistically significant.” But the point estimate hasn’t changed—only our ability to distinguish it from noise.

Why This Matters

Show code
# Get the mean point estimate for reference line
mean_effect <- mean(results_summary$point_estimate)

ggplot(results_summary, aes(x = sample_size, y = point_estimate)) +
  geom_hline(yintercept = mean_effect, linetype = "dashed", color = "gray50") +
  geom_segment(aes(xend = sample_size, y = 0, yend = point_estimate),
               linewidth = 1.5, color = "gray70") +
  geom_point(aes(color = significant), size = 8) +
  geom_text(aes(label = paste0("p = ",
                               ifelse(p_value < 0.001, "<.001", round(p_value, 3)))),
            vjust = -1.5, size = 3.5) +
  scale_color_manual(
    values = c("Yes (p < 0.05)" = treatment_color, "No (p ≥ 0.05)" = control_color),
    name = "Statistically Significant?"
  ) +
  annotate("text", x = 4.4, y = mean_effect,
           label = paste0("Point estimate ≈ ", round(mean_effect, 2)),
           hjust = 0, color = "gray50", fontface = "italic") +
  labs(
    title = "The Same Effect Can Be 'Significant' or Not",
    subtitle = "It depends on sample size, not whether the effect is real",
    x = "Sample Size",
    y = "Point Estimate (Treatment - Control)"
  ) +
  coord_cartesian(ylim = c(0, 0.8), clip = "off") +
  theme(
    legend.position = "bottom",
    plot.margin = margin(10, 80, 10, 10)
  )

Take-Home Messages

  1. “Non-significant” ≠ “No effect”: A high p-value means we can’t distinguish the signal from noise—not that there’s no signal.

  2. Sample size determines power: With small samples, even real effects often go undetected. This is why underpowered studies are problematic—they’re set up to fail.

  3. Report effect sizes, not just p-values: The point estimate (~0.47 points) is the same across all four trials. That’s the scientifically meaningful quantity. The p-value just tells you how surprised you should be to observe your study result IF the null were true.

  4. Absence of evidence ≠ Evidence of absence: Failing to reject the null doesn’t prove the null is true. It just means your data weren’t conclusive.

  5. When CAN we conclude “no effect”?: If a well-powered study finds a point estimate near zero with a narrow confidence interval that rules out meaningful effect sizes, that’s different. A precise null—where the estimate is centered on zero and the CI excludes clinically important effects—is genuine evidence of no effect. The key is having enough precision to rule out effects that would matter.

The next time you see a study conclude “no effect” based on p > 0.05, ask yourself: Was the study powered to detect a meaningful effect? How big was the observed effect, even if not “significant”? Is the confidence interval narrow enough to rule out important effects? The answers might change your interpretation entirely.