Learn Data Science from Data School ๐Ÿ“Š

Tuesday Tip #43: Should you discretize features for Machine Learning? ๐Ÿค–

Published 20 days agoย โ€ขย 2 min read

Hi Reader,

Today's tip is drawn directly from my upcoming course, Master Machine Learning with scikit-learn. You can read the tip below or watch it as a video!

If you're interested in receiving more free lessons from the course (which won't be included in Tuesday Tips), you can join the waitlist by clicking here:

๐Ÿ‘‰ Tip #43: Should you discretize continuous features for Machine Learning?

Let's say that you're working on a supervised Machine Learning problem, and you're deciding how to encode the features in your training data.

With a categorical feature, you might consider using one-hot encoding or ordinal encoding. But with a continuous numeric feature, you would normally pass that feature directly to your model. (Makes sense, right?)

However, one alternative strategy that is sometimes used with continuous features is to "discretize" or "bin" them into categorical features before passing them to the model.

First, I'll show you how to do this in scikit-learn. Then, I'll explain whether I think it's a good idea!

๐Ÿ‘ฉโ€๐Ÿ’ป How to discretize in scikit-learn

In scikit-learn, we can discretize using the KBinsDiscretizer class:

When creating an instance of KBinsDiscretizer, you define the number of bins, the binning strategy, and the method used to encode the result:

As an example, here's a numeric feature from the famous Titanic dataset:

And here's the output when we use KBinsDiscretizer to transform that feature:

Because we specified 3 bins, every sample has been assigned to bin 0 or 1 or 2. The smallest values were assigned to bin 0, the largest values were assigned to bin 2, and the values in between were assigned to bin 1.

Thus, we've taken a continuous numeric feature and encoded it as an ordinal feature (meaning an ordered categorical feature), and this ordinal feature could be passed to the model in place of the numeric feature.

๐Ÿค” Is discretization a good idea?

Now that you know how to discretize, the obvious follow-up question is: Should you discretize your continuous features?

Theoretically, discretization can benefit linear models by helping them to learn non-linear trends. However, my general recommendation is to not use discretization, for three main reasons:

(1) Discretization removes all nuance from the data, which makes it harder for a model to learn the actual trends that are present in the data.

(2) Discretization reduces the variation in the data, which makes it easier to find trends that don't actually exist.

(3) Any possible benefits of discretization are highly dependent on the parameters used with KBinsDiscretizer. Making those decisions by hand creates a risk of overfitting the training data, and making those decisions during a tuning process adds both complexity and processing time. As such, neither option is attractive to me!

For all of those reasons, I don't recommend discretizing your continuous features unless you can demonstrate, through a proper model evaluation process, that it's providing a meaningful benefit to your model.

๐Ÿ“š Further reading

๐Ÿ”— Discretization in the scikit-learn User Guide

๐Ÿ”— Discretize Predictors as a Last Resort from Feature Engineering and Selection (section 6.2.2)

๐Ÿ‘‹ See you next Tuesday!

Did you like this weekโ€™s tip? Don't forget to join the waitlist for my new ML course and get more free lessons!

- Kevin

P.S. My hobby: Extrapolatingโ€‹

Did someone AWESOME forward you this email? Sign up here to receive Data Science tips every week!

Learn Data Science from Data School ๐Ÿ“Š

Kevin Markham

Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Read more from Learn Data Science from Data School ๐Ÿ“Š

Hi Reader, happy Tuesday! My recent tips have been rather lengthy, so I'm going to mix it up with some shorter tips (like today's). Let me know what you think! ๐Ÿ’ฌ ๐Ÿ”— Link of the week A stealth attack came close to compromising the world's computers (The Economist) If you haven't heard about the recent "xz Utils backdoor", it's an absolutely fascinating/terrifying story! In short, a hacker (or team of hackers) spent years gaining the trust of an open-source project by making helpful...

13 days agoย โ€ขย 1 min read

Hi Reader, I'm so excited to share this week's tip with you! It has been in my head for months, but I finally put it in writing โœ๏ธ It's longer than usual, so if you prefer, you can read it as a blog post instead: Jupyter & IPython terminology explained ๐Ÿ”— Link of the week Python Problem-Solving Bootcamp (April 1-21) Want to improve your Python skills quickly? There's no better way than solving problems, reviewing alternative solutions, and exchanging ideas with others. That's the idea behind...

about 1 month agoย โ€ขย 3 min read

Hi Reader, I just published a new blog post, Get started with conda environments. If youโ€™re new to virtual environments in Python, give it a read! Once you start using virtual environments, youโ€™ll wonder how you ever got along without them! ๐Ÿ”— Link of the week Yann LeCun on the future of AI (Lex Fridman interview) Yann LeCun is one of the โ€œgodfathers of Deep Learningโ€, the Chief AI Scientist at Meta, and (in my opinion) one of the clearest and most convincing thinkers on the future of AI. Itโ€™s...

about 1 month agoย โ€ขย 1 min read
Share this post