|
Hi Reader, I'm really proud of this week's tip because it covers a topic (data leakage) that took me years to fully understand. π§ It's one of those times when I feel like I'm truly contributing to the collective wisdom by distilling complex ideas into an approachable format. π‘ You can read the tip below π or on my blog. π Link of the weekβBuilding an AI Coach to Help Tame My Monkey Mind (Eugene Yan) In this short post, Eugene describes his experiences calling an LLM on the phone for coaching: As I walk to work, I share my anxieties, use it as a sounding board for ideas, and clarify new concepts I come across in papers and podcasts. My AI coach helps me prep for difficult scenarios, such as giving tough feedback, so the actual conversation goes better. It also acts as a mirror by clarifying and reflecting my emotions back at me, and provides an external perspective. I love this idea. I called up his AI coach for a few minutes (for free; see the post for details), and while I definitely felt awkward, I could see how it would be useful once you've tuned the coach to your own needs (and gotten used to having a conversation with a bot!) π Tip #46: How to prevent data leakage in pandas & scikit-learnLet's pretend you're working on a supervised Machine Learning problem using Python's scikit-learn library. Your training data is in a pandas DataFrame, and you discover missing values in a column that you were planning to use as a feature. After considering your options, you decide to impute the missing values, which means that you're going to fill in the missing values with reasonable values. How should you perform the imputation?
Option 1 will cause data leakage, whereas option 2 will prevent data leakage. Here are questions you might be asking:
Answers below! π What is data leakage?Data leakage occurs when you inadvertently include knowledge from testing data when training a Machine Learning model. Why is data leakage problematic?Data leakage is problematic because it will cause your model evaluation scores to be less reliable. This may lead you to make bad decisions when tuning hyperparameters, and it will lead you to overestimate how well your model will perform on new data. It's hard to know whether data leakage will skew your evaluation scores by a negligible amount or a huge amount, so it's best to just avoid data leakage entirely. Why would data leakage result from missing value imputation in pandas?Your model evaluation procedure (such as cross-validation) is supposed to simulate the future, so that you can accurately estimate right now how well your model will perform on new data. But if you impute missing values on your whole dataset in pandas and then pass your dataset to scikit-learn, your model evaluation procedure will no longer be an accurate simulation of reality. That's because the imputation values will be based on your entire dataset (meaning both the training portion and the testing portion), whereas the imputation values should just be based on the training portion. In other words, imputation based on the entire dataset is like peeking into the future and then using what you learned from the future during model training, which is definitely not allowed. How can we avoid this in pandas?You might think that one way around this problem would be to split your dataset into training and testing sets and then impute missing values using pandas. (Specifically, you would need to learn the imputation value from the training set and then use it to fill in both the training and testing sets.) That would work if you're only ever planning to use train/test split for model evaluation, but it would not work if you're planning to use cross-validation. That's because during 5-fold cross-validation (for example), the rows contained in the training set will change 5 times, and thus it's quite impractical to avoid data leakage if you use pandas for imputation while using cross-validation! How else can data leakage arise?So far, I've only mentioned data leakage in the context of missing value imputation. But there are other transformations that if done in pandas on the full dataset will also cause data leakage. For example, feature scaling in pandas would lead to data leakage, and even one-hot encoding (or "dummy encoding") in pandas would lead to data leakage unless there's a known, fixed set of categories. More generally, any transformation which incorporates information about other rows when transforming a row will lead to data leakage if done in pandas. How does scikit-learn prevent data leakage?Now that you've learned how data transformations in pandas can cause data leakage, I'll briefly mention three ways in which scikit-learn prevents data leakage:
ConclusionWhen working on a Machine Learning problem in Python, I recommend performing all of your data transformations in scikit-learn, rather than performing some of them in pandas and then passing the transformed data to scikit-learn. Besides helping you to prevent data leakage, this enables you to tune the transformer and model hyperparameters simultaneously, which can lead to a better performing model! π See you next Tuesday!Did you like this weekβs tip? Please forward it to a friend or share this tip with your favorite Machine Learning community. It really helps me out! - Kevin P.S. Bits of wisdom from 73 years of livingβ Did someone AWESOME forward you this email? Sign up here to receive Data Science tips every week! |
Join 25,000+ intelligent readers and receive AI tips every Tuesday!
Hi Reader, I'm thrilled to announce that my new book, Master Machine Learning with scikit-learn, is now on sale! Buy from Amazon I poured my heart and soul into making this the highest quality and most practical Machine Learning book available. Publishing this book is a dream come true, and I'd be grateful if you'd consider picking up a copy! π Option 1: Get the paperback from Amazon ($19) Although most technical books of this size (300+ pages) tend to sell for at least $39, I've priced the...
Hi Reader, A few months ago, I announced that my new book, Master Machine Learning with scikit-learn, would be published in December. Since then, my personal life has undergone some dramatic changes π₯΄ During the transition, it has been challenging to focus on anything other than bare life essentials π½οΈ π πΏ Thankfully, my life has begun to steady (yay!), and so in the past few weeks I've been able to wrap up some key pieces of the project! β I'm thrilled to hold in my hands the FINAL proof...
Hi Reader, happy new year! π I wanted to share with you the three most important articles I found that look back at AI progress in 2025 and look forward at what is coming in 2026 and beyond. Iβve extracted the key points from each article, but if you have the time and interest, Iβd encourage you to read the full articles! π The Shape of AI: Jaggedness, Bottlenecks and Salients By Ethan Mollick βJaggednessβ describes the uneven abilities of AI: Itβs superhuman in some areas and far below human...