Custom Loss Functions for Gradient Boosting
Gradient boosting is widely used in industry and has won many Kaggle competitions. The internet already has many good explanations of gradient boosting (we've even shared some selected links in the references), but we've noticed a lack of information about custom loss functions: the why, when, and how. This post is our attempt to summarize the importance of custom loss functions in many real-world problems—and how to implement them with the LightGBM gradient boosting package.
Machine learning algorithms are trained to minimize a loss function on the training data. There are a number of commonly used loss functions that are readily available in common ML libraries. If you want to learn more about some of these, read this post, which Prince wrote while doing his Masters in Data Science. Out in the real world, these “off-the-shelf” loss functions are often often not well-tuned to the business problem we are trying to solve. Enter custom loss functions.
Custom Loss Functions
One example where a custom loss function is handy is the asymmetric risk of airport punctuality. The problem is to decide when to leave the house so you get to the airport at the right time. We don't want to leave too early and wait for hours at the gate. At the same time, we don't want to miss our flight. The losses on either side are very different: if we get to the gate early, it’s really not so bad; if we arrive too late and miss the flight, it really sucks. If we use machine learning to decide when to leave the house, we might want to take care of this risk assymetry directly in our model, by using a custom loss function that penalizes late errors much more than early errors.
Another common example occurs in classification problems. For example, for disease detection, we may consider false negatives to be much worse than false positives, as giving medication to a healthy person is usually less harmful than failing to treat an ill person. In such cases, we might want to optimize the F-beta score where beta depends upon magnitude of weight we want to give to false positives. This is sometimes called a Neyman-Pearson criterion.
At Manifold, we recently ran into a problem that required a custom loss function. One of our clients, Cortex Building Intelligence, provides an app that helps engineers operate a buildings heating, ventilation, and air conditioning (HVAC) systems more precisely. Most commercial buildings have a “lease obligation” to condition the buildings indoor temperature in a “comfortable” temperature range during working hours on operating days, e.g., between 70 and 74 degrees during the hours of 9am to 6pm. At the same time, HVAC is the single biggest operational cost of a building. The key to efficient HVAC operation is to turn off the system when it is not needed, like at night, and turn it on again early in the morning to meet the “lease obligation”. To that end, Manifold helped Cortex build a predictive model to recommend the exact time when HVAC systems should be turned on in a building.
The penalty of incorrect prediction is not symmetric, though. If our predicted start time is earlier than the actual required start time, then the building will come to a comfortable temperature too early and some energy will be wasted. But if the predicted time is later than the actual required start time, then the building will come to a comfortable temperature too late and the tenants will not be happy — no one wants to work, shop, or learn in a freezing/boiling building. So being late is much worse than being early, because we don’t want tenants (who pay $$$ in rent) to be unhappy. We encoded this business knowledge in our model by creating a custom asymmetric Huber loss function that has a higher error when the residual is positive vs. negative. More details about this problem can be found in this post.
Takeaway: Find a loss function that closely matches your business objectives. Often, these loss functions don't have a default implementation in the popular machine learning libraries. That's ok: it's not hard to define your own loss function and use it to crush your problem.
Custom Training Loss and Validation Loss
Before moving further, let's be clear in our definitions. Many terms are used in the ML literature to refer to different things. We will choose one set of definitions that we think is the most clear:
- Training loss. This is the function that is optimized on the training data. For example, in a neural network binary classifier, this is usually the binary cross entropy. For the random forest classifier, this is the Gini impurity. The training loss is often called the “objective function” as well.
- Validation loss. This is the function that we use to evaluate the performance of our trained model on unseen data. This is often not the same as the training loss. For example, in the case of a classifier, this is often the area under the curve of the receiver operating characteristic (ROC)—though this is never directly optimized, because it is not differentiable. This is often called the “performance or evaluation metric”.
In many cases, customizing these losses can be really effective in building a better model. This is particularly simple for gradient boosting, as we show below.
Training lossThe training loss is optimized during training. It is hard to customize for certain algorithms, like the random forest (see here), but relatively easy for others, like gradient boosting and neural nets. Because some variant of gradient descent is usually the optimization method, the training loss typically needs to be a function that has a convex gradient (first derivative) and hessian (second derivative). It is also preferably continuous, finite, and non-zero. The last one is important because sections where the function is zero can freeze gradient descent.
In the context of gradient boosting, the training loss is the function that is optimized using gradient descent, e.g., the “gradient” part of gradient boosting models. Specifically, the gradient of the training loss is used to change the target variables for each successive tree. (If you're interested in more details, see this post.) Note that even though the training loss defines the “gradient”, each tree is still grown using a greedy split algorithm that is not tied to this custom loss function.
Defining a custom training loss usually requires us to do some calculus to find the gradient and hessian. As we'll see next, often it is easier to first change the validation loss, as it doesn't require as much overhead.
The validation loss is used to tune hyper-parameters. It is often easier to customize, as it doesn't have as many functional requirements like the training loss does. The validation loss can be non-convex, non-differentiable, and discontinuous. For this reason, it is often an easier place to start with customization.
For example, in LightGBM, an important hyperparameter is number of boosting rounds. The validation loss can be used to find the optimum number of boosting rounds. This validation loss in LightGBM is called
eval_metric. We can either use one of the validation losses available in library or define our own custom function. Since it is so easy, you should definitely customize if it is important to your business problem.
Concretely, instead of directly optimizing num boosting rounds, we usually use the early_stopping_rounds variable. It stops boosting when the validation loss starts increasing for the given number of early stopping rounds. Effectively, it prevents overfitting by monitoring the validation loss on the out of sample validation set. As shown in the figure below, setting stopping rounds higher leads to the model running for more boosting rounds.
Blue: Training loss. Orange: Validation loss. Both training and validation are using the same custom loss function.
Bear in mind that the validation strategy is extremely important as well. The train/validation split above is one of many possible validation strategies. It may not be right for your problem. Others include k-fold cross validation and nested cross validation, which we used on our HVAC start time modeling problem.
If appropriate for the business problem, we want to use a custom function for both our training and validation loss. In some situations, because of the functional form of the custom loss, it may not be possible to use it as the training loss. In that case, it may make sense to just update the validation loss and use a default training loss like the MSE. You will still get benefit, because the hyper parameters will be tuned using the desired custom loss.
Implementing Custom Loss Functions in LightGBM
Let's examine what this looks like in practice and do some experiments on simulated data. First, let’s assume that overestimates are much worse than underestimates. In addition, let's assume that squared loss is a good model for our error in either direction. To encode that, we defined a custom MSE function that gives 10 times more penalty to positive residuals than negative residuals. The plot below illustrates how our custom loss function looks vs. the standard MSE loss function.
The asymmetric MSE, as defined, is nice because it has an easy to compute gradient and hessian, which are plotted below. Note that the hessian is constant at two different values, 2 on the left and 20 on the right, though that is hard to see on the plot below.
LightGBM offers a straightforward way to implement custom training and validation losses. Other gradient boosting packages, including XGBoost and Catboost, also offer this option. Here is a Jupyter notebook that shows how to implement a custom training and validation loss function. The details are in the notebook, but at a high level, the implementations are slightly different:
- Training loss: Customizing the training loss in LightGBM requires defining a function that takes in two arrays, the targets and their predictions. In turn, the function should return two arrays of the gradient and hessian of each observation. As noted above, we need to use calculus to derive gradient and hessian and then implement it in Python.
- Validation Loss: Customizing the validation loss in LightGBM requires defining a function that takes in the same two arrays, but returns three values: a string with name of metric to print, the loss itself, and a boolean about whether higher is better.
Experiments with Custom Loss Functions
The Jupyter notebook also does an in-depth comparison of a default Random Forest, default LightGBM with MSE, and LightGBM with custom training and validation loss functions. We work with the Friedman 1 synthetic dataset, with 8,000 training observations, 2,000 validation observations, and 5,000 test observations. The validation set is used to find the best set of hyperparameters that optimize our validation loss. The scores reported below are evaluated on the test observations to assess the generalizability of our models.
We have done a sequence of experiments summarized in the table below. Note that the most important score we care about is asymmetric MSE as it specifically defines our problem of asymmetric penalty.
Let's look at some comparisons in detail.
- Random Forest → LightGBM
Using default settings, LightGBM performs better than Random Forest on this dataset. With more trees and better combination of hyperparameters, Random Forest may also give good results, but that’s not the point here.
- LightGBM → LightGBM with customized training loss
This shows that we can make our model optimize what we care about. The default LightGBM is optimizing MSE, hence it gives lower MSE loss (0.24 vs. 0.33). The LightGBM with custom training loss is optimizing asymmetric MSE and hence it performs better for asymmetric MSE (1.31 vs. 0.81).
- LightGBM → LightGBM with tuned early stopping rounds using MSE
Both the LightGBM models are optimizing MSE. We see a big improvement in default MSE scores with just a small tweak of using early stopping rounds (MSE: 0.24 vs 0.14). So, rather than limiting the number of boosting rounds to default (i.e., 100), we should let the model decide optimal number of boosting rounds using the early stopping hyper parameter. Hyper parameter optimization matters!
- LightGBM with tuned early stopping rounds using MSE → LightGBM with tuned early stopping using custom MSE
The scores from both of these models are very close with no material difference. This is because the the validation loss is only used to decide when to stop boosting. The gradient is optimizing default MSE in both the cases. Each subsequent tree produces same output for both of the models. The only difference is that the model with custom validation loss stops at 742 boosting iterations while the other runs for a few more.
- LightGBM with tuned early stopping using custom MSE → LightGBM trained on custom loss and tuned early stopping with MSE
Only customizing the training loss without changing the validation loss hurts the model performance. The model with only custom training loss boosts for more rounds (1848) than other cases. If we observe carefully, this model has really low training loss (0.013) and is highly overfitting on the training set. Each gradient boosting iteration makes a new tree using training errors as target variables, but the boosting stops only when loss on validation data start increasing. The validation loss usually starts increasing when the model starts overfitting, which is the signal to stop building more trees. In this case, since validation and training loss are not aligned with each other, the model doesn't seem to “get the message” which leads to overfitting. This configuration was just included for completeness and is not something one should use in practice.
- LightGBM with tuned early stopping rounds with MSE → LightGBM trained on custom training loss and tuned early stopping rounds with customized validation loss The final model uses both custom training and validation losses. It gives the best asymmetric MSE score with relatively small number of boosting iterations. The losses are aligned with what we care about!
Let's take a closer look at the residual histograms for some more detail.
Histogram of residuals on predictions from different models.
Note that with LightGBM (even with default hyperparameters), the prediction performance improves as compared to the Random Forest model. The final model with custom validation loss appears to make more predictions on the right side of histogram, i.e. the actual values are greater than the predicted values. This is expected because of the asymmetric custom loss function. This right sided shift of the residuals can be better visualized using a kernel density plot of the residuals.
Comparison of predictions from LightGBM models with symmetric and asymmetric evaluation
ConclusionAll models have error, but many business problems do not treat underestimates and overestimates equally. Sometimes, we intentionally want our model to bias our errors in a certain direction, depending on which errors are more costly. Hence, we should not restrict ourselves with “off-the-shelf” symmetric loss functions from common ML libraries.
LightGBM offers a simple interface to incorporate custom training and validation loss functions. When appropriate, we should utilize this functionality to make better predictions. At the same time, you should not immediately jump to using custom loss functions. It's always best to take a lean, iterative approach and first start with a simple baseline model like a Random Forest. In the next iteration, you can move more complex models like LightGBM and do hyperparameter optimization. Only once these baselines are stabilized, it makes sense to move onto customizing validation and training losses.
Hopefully that was useful! Happy customizing!
If you are unclear about how general gradient boosting works, we recommend reading How to explain gradient boosting by Terence Parr, and Gradient boosting from scratch by Prince.
There are plenty of articles out there on how to tune hyper-parameters in different GBM frameworks. If you want to use one of these packages, you can spend some time on understanding which range of hyperparameters to search for. This LightGBM GitHub issue gives a rough idea about the range of values to use. Aarshay Jain has written a nice blog about tuning XGBoost and sklearn gradient boosting. I think there is room for a good blogpost about tuning LightGBM.
To get some intuition about which gradient boosting package is right for your situation, read Alvira Swalin’s CatBoost vs. Light GBM vs. XGBoost, and Pranjan Khandelwal’s Which algorithm takes the crown: Light GBM vs XGBOOST?.
Editor's Note: This post also appears on Medium at Towards Data Science.