Hi Reader, I'm really proud of this week's tip because it covers a topic (data leakage) that took me years to fully understand. ๐ง It's one of those times when I feel like I'm truly contributing to the collective wisdom by distilling complex ideas into an approachable format. ๐ก You can read the tip below ๐ or on my blog. ๐ Link of the weekโBuilding an AI Coach to Help Tame My Monkey Mind (Eugene Yan) In this short post, Eugene describes his experiences calling an LLM on the phone for coaching: As I walk to work, I share my anxieties, use it as a sounding board for ideas, and clarify new concepts I come across in papers and podcasts. My AI coach helps me prep for difficult scenarios, such as giving tough feedback, so the actual conversation goes better. It also acts as a mirror by clarifying and reflecting my emotions back at me, and provides an external perspective. I love this idea. I called up his AI coach for a few minutes (for free; see the post for details), and while I definitely felt awkward, I could see how it would be useful once you've tuned the coach to your own needs (and gotten used to having a conversation with a bot!) ๐ Tip #46: How to prevent data leakage in pandas & scikit-learnLet's pretend you're working on a supervised Machine Learning problem using Python's scikit-learn library. Your training data is in a pandas DataFrame, and you discover missing values in a column that you were planning to use as a feature. After considering your options, you decide to impute the missing values, which means that you're going to fill in the missing values with reasonable values. How should you perform the imputation?
Option 1 will cause data leakage, whereas option 2 will prevent data leakage. Here are questions you might be asking:
Answers below! ๐ What is data leakage?Data leakage occurs when you inadvertently include knowledge from testing data when training a Machine Learning model. Why is data leakage problematic?Data leakage is problematic because it will cause your model evaluation scores to be less reliable. This may lead you to make bad decisions when tuning hyperparameters, and it will lead you to overestimate how well your model will perform on new data. It's hard to know whether data leakage will skew your evaluation scores by a negligible amount or a huge amount, so it's best to just avoid data leakage entirely. Why would data leakage result from missing value imputation in pandas?Your model evaluation procedure (such as cross-validation) is supposed to simulate the future, so that you can accurately estimate right now how well your model will perform on new data. But if you impute missing values on your whole dataset in pandas and then pass your dataset to scikit-learn, your model evaluation procedure will no longer be an accurate simulation of reality. That's because the imputation values will be based on your entire dataset (meaning both the training portion and the testing portion), whereas the imputation values should just be based on the training portion. In other words, imputation based on the entire dataset is like peeking into the future and then using what you learned from the future during model training, which is definitely not allowed. How can we avoid this in pandas?You might think that one way around this problem would be to split your dataset into training and testing sets and then impute missing values using pandas. (Specifically, you would need to learn the imputation value from the training set and then use it to fill in both the training and testing sets.) That would work if you're only ever planning to use train/test split for model evaluation, but it would not work if you're planning to use cross-validation. That's because during 5-fold cross-validation (for example), the rows contained in the training set will change 5 times, and thus it's quite impractical to avoid data leakage if you use pandas for imputation while using cross-validation! How else can data leakage arise?So far, I've only mentioned data leakage in the context of missing value imputation. But there are other transformations that if done in pandas on the full dataset will also cause data leakage. For example, feature scaling in pandas would lead to data leakage, and even one-hot encoding (or "dummy encoding") in pandas would lead to data leakage unless there's a known, fixed set of categories. More generally, any transformation which incorporates information about other rows when transforming a row will lead to data leakage if done in pandas. How does scikit-learn prevent data leakage?Now that you've learned how data transformations in pandas can cause data leakage, I'll briefly mention three ways in which scikit-learn prevents data leakage:
ConclusionWhen working on a Machine Learning problem in Python, I recommend performing all of your data transformations in scikit-learn, rather than performing some of them in pandas and then passing the transformed data to scikit-learn. Besides helping you to prevent data leakage, this enables you to tune the transformer and model hyperparameters simultaneously, which can lead to a better performing model! ๐ See you next Tuesday!Did you like this weekโs tip? Please forward it to a friend or share this tip with your favorite Machine Learning community. It really helps me out! - Kevin P.S. Bits of wisdom from 73 years of livingโ Did someone AWESOME forward you this email? Sign up here to receive Data Science tips every week! |
Join 25,000+ intelligent readers and receive AI tips every Tuesday!
Hi Reader, The Python 14-Day Challenge starts tomorrow! Hope to see you there ๐ค ๐ Tuesday Tip: My top 5 sources for keeping up with AI I'll state the obvious: AI is moving incredibly FAST ๐จ Here are the best sources I follow to keep up with the most important developments in Artificial Intelligence: The Neuron (daily newsletter) My top recommendation for a general audience. Itโs fun, informative, and well-written. It includes links to the latest AI news and tools, but the real goldmine is...
Hi Reader, Before todayโs tip, I wanted to give you a heads up: Tomorrow, Iโll be launching something brand new! Watch out for the announcement ๐ ๐ Tip #53: How to get great results from AI models through prompting In the year after ChatGPT was released, I remember noticing two new trends: Articles about โprompt engineersโ being hired for hundreds of thousands of dollars just to write prompts Endless guides promising to teach you the secrets of writing the perfect ChatGPT prompt My takeaway...
Hi Reader, Last week, I encouraged you to experiment with different LLMs, since thereโs no one model that is superior across all use cases. Specifically, I suggested you try using Chatbot Arena, which allows you to chat with multiple models at once. Itโs completely free, but has two significant disadvantages: Your chats are not private and may be used for research. It lacks the feature-rich interface provided by other LLMs. Today, I want to offer you a better method for experimenting with...