Hi Reader, I'm really proud of this week's tip because it covers a topic (data leakage) that took me years to fully understand. 🧠 It's one of those times when I feel like I'm truly contributing to the collective wisdom by distilling complex ideas into an approachable format. 💡 You can read the tip below 👇 or on my blog. 🔗 Link of the weekBuilding an AI Coach to Help Tame My Monkey Mind (Eugene Yan) In this short post, Eugene describes his experiences calling an LLM on the phone for coaching: As I walk to work, I share my anxieties, use it as a sounding board for ideas, and clarify new concepts I come across in papers and podcasts. My AI coach helps me prep for difficult scenarios, such as giving tough feedback, so the actual conversation goes better. It also acts as a mirror by clarifying and reflecting my emotions back at me, and provides an external perspective. I love this idea. I called up his AI coach for a few minutes (for free; see the post for details), and while I definitely felt awkward, I could see how it would be useful once you've tuned the coach to your own needs (and gotten used to having a conversation with a bot!) 👉 Tip #46: How to prevent data leakage in pandas & scikit-learnLet's pretend you're working on a supervised Machine Learning problem using Python's scikit-learn library. Your training data is in a pandas DataFrame, and you discover missing values in a column that you were planning to use as a feature. After considering your options, you decide to impute the missing values, which means that you're going to fill in the missing values with reasonable values. How should you perform the imputation?
Option 1 will cause data leakage, whereas option 2 will prevent data leakage. Here are questions you might be asking:
Answers below! 👇 What is data leakage?Data leakage occurs when you inadvertently include knowledge from testing data when training a Machine Learning model. Why is data leakage problematic?Data leakage is problematic because it will cause your model evaluation scores to be less reliable. This may lead you to make bad decisions when tuning hyperparameters, and it will lead you to overestimate how well your model will perform on new data. It's hard to know whether data leakage will skew your evaluation scores by a negligible amount or a huge amount, so it's best to just avoid data leakage entirely. Why would data leakage result from missing value imputation in pandas?Your model evaluation procedure (such as cross-validation) is supposed to simulate the future, so that you can accurately estimate right now how well your model will perform on new data. But if you impute missing values on your whole dataset in pandas and then pass your dataset to scikit-learn, your model evaluation procedure will no longer be an accurate simulation of reality. That's because the imputation values will be based on your entire dataset (meaning both the training portion and the testing portion), whereas the imputation values should just be based on the training portion. In other words, imputation based on the entire dataset is like peeking into the future and then using what you learned from the future during model training, which is definitely not allowed. How can we avoid this in pandas?You might think that one way around this problem would be to split your dataset into training and testing sets and then impute missing values using pandas. (Specifically, you would need to learn the imputation value from the training set and then use it to fill in both the training and testing sets.) That would work if you're only ever planning to use train/test split for model evaluation, but it would not work if you're planning to use cross-validation. That's because during 5-fold cross-validation (for example), the rows contained in the training set will change 5 times, and thus it's quite impractical to avoid data leakage if you use pandas for imputation while using cross-validation! How else can data leakage arise?So far, I've only mentioned data leakage in the context of missing value imputation. But there are other transformations that if done in pandas on the full dataset will also cause data leakage. For example, feature scaling in pandas would lead to data leakage, and even one-hot encoding (or "dummy encoding") in pandas would lead to data leakage unless there's a known, fixed set of categories. More generally, any transformation which incorporates information about other rows when transforming a row will lead to data leakage if done in pandas. How does scikit-learn prevent data leakage?Now that you've learned how data transformations in pandas can cause data leakage, I'll briefly mention three ways in which scikit-learn prevents data leakage:
ConclusionWhen working on a Machine Learning problem in Python, I recommend performing all of your data transformations in scikit-learn, rather than performing some of them in pandas and then passing the transformed data to scikit-learn. Besides helping you to prevent data leakage, this enables you to tune the transformer and model hyperparameters simultaneously, which can lead to a better performing model! 👋 See you next Tuesday!Did you like this week’s tip? Please forward it to a friend or share this tip with your favorite Machine Learning community. It really helps me out! - Kevin P.S. Bits of wisdom from 73 years of living Did someone AWESOME forward you this email? Sign up here to receive Data Science tips every week! |
Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!
Hi Reader, Next week, I’ll be offering a Black Friday sale on ALL of my courses. I’ll send you the details this Thursday! 🚨 👉 Tip #50: What is a "method" in Python? In Python, a method is a function that can be used on an object because of the object's type. For example, if you create a Python list, the "append" method can be used on that list. All lists have an "append" method simply because they are lists: If you create a Python string, the "upper" method can be used on that string simply...
Hi Reader, I appreciate everyone who has emailed to check on me and my family post-Helene! It has been more than 6 weeks since the hurricane, and most homes in Asheville (mine included) still don't have clean, running water. We're hopeful that water service will return within the next month. In the meantime, we're grateful for all of the aid agencies providing free bottled water, free meals, places to shower, and so much more. ❤️ Thanks for allowing me to share a bit of my personal life with...
Hi Reader, Regardless of whether you enrolled, thanks for sticking with me through the launch of my new course! 🚀 I've already started exploring topics for the next course... 😄 🔗 Link of the week git cheat sheet (PDF) A well-organized and highly readable cheat sheet from Julia Evans, the brilliant mind behind Wizard Zines! 👉 Tip #48: Three ways to set your environment variables in Python I was playing around with Mistral LLM this weekend (via LangChain in Python), and I needed to set an...