How to Implement Random Forest Regression in PySpark

Introduction
PySpark is a powerful data processing engine built on top of Apache Spark and designed for large-scale data processing. It provides scalability, speed, versatility, integration with other tools, ease of use, built-in Machine Learning libraries, and real-time processing capabilities. It is an ideal choice for handling large-scale data processing tasks efficiently and effectively, and its user-friendly interface allows for easy code writing in Python.
Using the Diamonds data found on ggplot2 (source, license), we will walk through how to implement a random forest regression model and analyze the results with PySpark. If you'd like to see how linear regression is applied to the same dataset in PySpark, you can check it out here!
This tutorial will cover the following steps:
- Load and prepare the data into a vectorized input
- Train the model using RandomForestRegressor from MLlib
- Evaluate model performance using RegressionEvaluator from MLlib
- Plot and analyze feature importance for model transparency

Prepare the Data
The diamonds
dataset contains features such as carat
, color
, cut
, clarity
, and more, all listed in the dataset documentation.
The target variable that we are trying to predict for is price
.
df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
display(df)

Just like the linear regression tutorial, we need to preprocess our data so that we have a resulting vector of numerical features to use as our model input. We need to encode our categorical variables into numerical features and then combine them with our numerical variables to make one final vector.
Here are the steps to achieve this result:
- Separate numerical and categorical features.
- Preprocess categorical features with StringIndexer: This is essentially assigning a numeric value to each category (i.e.; Fair: 0, Ideal: 1, Good: 2, Very Good: 3, Premium: 4).
- Preprocess categorical features with OneHotEncoder: This converts categories into binary vectors. The result is a SparseVector that indicates which index from StringIndexer has the one-hot value of 1.
- Combine your numerical features with your new categorical encoded features.
- Assemble feature vector with VectorAssembler.
You can use the following code to index and one-hot encode your categorical features. This will complete steps 1–3.
from Pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder
cat_cols= ["cut", "color", "clarity"]
stages = [] # Stages in Pipeline
for c in cat_cols:
stringIndexer = StringIndexer(inputCol=c, outputCol=c + "_index")
encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()],
outputCols=[c + "_vec"])
stages += [stringIndexer, encoder] # Stages will be run later on
You can use the following code to assemble your final feature vector. This will complete steps 4–5. Then, we can run the stages as a pipeline. This runs the data through all the feature transformations we've defined so far.
Python">from pyspark.ml.feature import VectorAssembler
# Transform all features into a vector
num_cols = ["carat", "depth", "table", "x", "y", "z"]
assemblerInputs = [c + "_vec" for c in cat_cols] + num_cols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Create pipeline and use on dataset
pipeline = Pipeline(stages=stages)
df = pipeline.fit(df).transform(df)
Train the Model
First, we need to split our dataset into train and test sets.
train, test = df.randomSplit([0.90, 0.1], seed=123)
print('Train dataset count:', train.count())
print('Test dataset count:', test.count())
Next, let's import the Random Forest Regressor (pyspark.ml.regression.RandomForestRegressor) model from MLlib. Some of the default of some parameters to pay attention to are: maxDepth=5
, numTrees=20
. You can adjust these during cross-validation or manually in order to get the best set of parameters for your problem. Then call fit()
to fit the model to the training data. After fitting the model to our train data, we can start predicting. To get predictions on the data, call transform()
.
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor(featuresCol='features', labelCol='price')
rf_model = rf.fit(train)
train_predictions = rf_model.transform(train)
test_predictions = rf_model.transform(test)
Finally, you can use MLlib's RegressionEvaluator to evaluate the model. There are multiple evaluation metrics to choose from for your use case:
- RMSE – Root mean squared error (Default)
- MSE – Mean squared error
- R2 – R-squared
- MAE – Mean absolute error
- Var – Explained variance
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol="prediction",
labelCol="price", metricName="r2")
print("Train R2:", evaluator.evaluate(train_predictions))
print("Test R2:", evaluator.evaluate(test_predictions))
Train R2: 0.9092 | Test R2: 0.9069
Analyze Feature Importance
While building the Random Forest, the algorithm tries to minimize entropy, which is a measure of uncertainty. Entropy is maximized in uniform distribution (i.e.; we don't know if a coin will flip heads or tails), and more uncertainty requires more information. The Random Forest algorithm uses Information Gain (IG) which is equal to the entropy before minus the entropy after, weighted by the number of examples. This helps the algorithm decide which feature should be used to split the data.
One thing you'll want to look into is which features are most relevant in your model. The Random Forest algorithm has built-in feature importance which can be calculated in different ways. PySpark Random Forest follows the scikit-learn implementation that uses Gini importance (or mean decrease impurity). Scikit-learn also provides an implementation of permutation-based feature importance, but this is not built into PySpark.
As described in the documentation, the feature importance is calculated by:
- importance(feature j) = sum (over nodes which split on feature j) of the gain, where the gain is scaled by the number of instances passing through the node
- Normalize feature importances to sum to 1.
We can extract the feature importance from a fitted Random Forest model using rf_model.featureImportances
. Then, use this feature importance and match it to the extracted feature names to make it available to view or plot. This view is especially of high interest to key stakeholders who want to understand the key drivers of the model.
def extract_feature_imp(feature_imp, dataset, features_col):
list_extract = []
for i in dataset.schema[features_col].metadata["ml_attr"]["attrs"]:
list_extract = list_extract + dataset.schema[features_col].metadata["ml_attr"]["attrs"][i]
feature_list = pd.DataFrame(list_extract)
feature_list['score'] = feature_list['idx'].apply(lambda x: feature_imp[x])
return(feature_list.sort_values('score', ascending = False))
import pandas as pd
feature_list = extract_feature_imp(rf_model.featureImportances, train, "features")
top_20_features = feature_list.sort_values('score', ascending = False).head(20)
# Then make your desired plot function to visualize feature importance
plot_feature_importance(top_20_features['score'], top_20_features['name'])

Summary
In this tutorial, we covered the following steps for implementing Random Forest with PySpark.
- One-hot encode categorical features using StringIndexer and OneHotEncoder
- Create input feature vector column using VectorAssembler
- Split data into train and test
- Initialize and fit the RandomForestRegressor model on train data
- Transform model on test data to make predictions
- Evaluate the model with RegressionEvaluator
- Analyze feature importance to understand and improve the model
These steps allow us to prepare the data into a vectorized input so that we can implement a random forest model in PySpark with Apache Spark's scalable machine learning library, MLlib. We can evaluate the model performance and plot and analyze the random forest feature importance to understand our model better. This allows us to take advantage of the scalability, speed, and versatility that PySpark provides.