Where Does the ‘Random’ in Random Forest Come From?

Yukio
2 min readOct 27, 2024

--

Image source: https://www.templeton.org/news/the-ineffable-purpose-of-randomness

The other day, I was reading about randomness, and the author brought up some algorithms that make clever use of it. This got me thinking about how we often simplify Random Forests, describing them as “just a bunch of decision trees,” when there’s actually more going on under the hood. So, why not dive a bit deeper into how randomness really shapes the Random Forest and improve its predictions?

Image source: https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d

Randomness plays two key roles in making Random Forests effective: during the bootstrapping of training data and in feature selection. Each of these uses of randomness helps solve a core challenge in machine learning — improving model accuracy and generalization without overfitting.

First, let’s talk about the “forest” of decision trees. While it’s true that a Random Forest is basically getting many trees’ results together, these trees are created through unique processes, adding variation to each one. For each tree, we don’t use the entire dataset. Instead, we take a sample of the training data using a method called sampling with replacement, or bootstrapping.

This means some data points may appear multiple times in one tree’s training data, while others might be left out entirely. By giving each tree its own slightly different sample, bootstrapping helps reduce the correlation between trees. The result? A diverse set of trees that, together, boost the performance of the overall model.

But the randomness doesn’t stop there. Beyond sampling data, each tree in a Random Forest randomly selects only a subset of available features. Rather than using all features, each tree relies on a unique mix of data characteristics. This feature selection approach encourages each tree to capture different patterns and relationships within the data. When all these trees, each with their own “view” of the data, are combined, they produce a stronger and more flexible model.

Ultimately, randomness increases diversity among the trees in a Random Forest, helping it overcome the limitations of a single Decision Tree, like overfitting and lack of generalization. So, now you can see how the Random Forest model is more than just a collection of trees; it’s a carefully crafted system where randomness plays a crucial role in making it a powerful tool for machine learning!

--

--

Yukio
Yukio

Written by Yukio

Mathematician with a master degree in Economics. Working as a Data Scientist for the last 10 years.

No responses yet