👋 Hi! Imagine this: you’re at work, and your boss drops a new dataset on your desk.
It’s straightforward—just a list of how long 400,000 people took to run a marathon.
The ask? Plot the distribution of those finishing times.
Too easy
In French, we’d say "finger in the nose" (though that doesn’t quite translate well into English).
Anyway, I’m sure you know how to make a histogram, and if not, I’ve written plenty of tutorials using R, Python, or JavaScript to guide you 🎉.
A few minutes later, you end up with a neat histogram that looks like a almost normal distribution:
Distribution of time spent running a marathon for 400,000 people
Job done!
Yes but!
You might think the job is done already.
But there’s a big problem here.
Can you spot it?
Take a moment to think before scrolling! Here is an image to make sure you do not see the answer right away.
I used to go surfing there when I lived in Brisbane, Australia. (It's snapper rock, one of the most famous wave on the planet 🙂)
Did you miss the key story?
By changing the bin size, a whole new story emerges.
Suddenly, distinct spikes appear around 3:00, 3:30, and 4:00 hours.
Why?
Because these are popular time goals for marathon runners, and they push hard to finish at these times. So, it’s much more common for someone to finish around 4:00 than at, say, 4:05.
With smaller bin size, 3 breaks are revealed in the histogram!
If you want to explore this yourself, I’ve created an interactive version of the chart where you can adjust the bin size with a slider.