profile

Learn Data Science from Data School 📊

Tuesday Tip #31: Data cleaning with pandas 🧹

Published 6 months ago • 2 min read

Hi Reader,

In two weeks, I’ll be offering a Black Friday sale on ALL of my courses.

I’ll send you the details next week!


🔗 Link of the week

Easier data analysis with pandas (free)

This is my video series about pandas, the most popular Python library for data analysis. There are 30+ videos, most of which are designed for pandas beginners, though intermediate users can skip to the “bonus” videos at the bottom.

Normally I won’t link to my own tutorials in this section, but it seemed appropriate since the next few tips will be about pandas! 🐼


👉 Tip #31: Basic data cleaning operations with pandas

In last week’s tip, I showed you three visualizations that I created from the World Happiness Report data: a world map, a line chart, and an arrow chart.

I created those visualizations using Datawrapper, a no-code tool that I highly recommend checking out!

As is the case with many Data Science projects, I spent most of my time on the data cleaning step. Once the data was clean, Datawrapper made the visualization step quite easy!

In this week’s tip, I’m going to show you how I transformed the raw data into the world map data using pandas. Here are the specific steps I’ll demonstrate:

  1. Read in the data from Excel
  2. Change column names & round data
  3. Fix country names
  4. Filter data by year
  5. Write data to CSV file

If you want to follow along with the code, you can run it online using Google Colab.

Here’s the end result:


Step 1: Read in the data from Excel

The dataset is stored online in an Excel sheet, so I read it directly from the URL using the read_excel function. (Note that I split the URL string into two lines by adding parentheses around it.)

To avoid storing unnecessary data, I only kept the three columns that were relevant for the analysis. Life Ladder is the average “life evaluation” reported by residents, on a scale from 0 to 10, and is the report’s primary measure of happiness.


Step 2: Change column names & round data

I renamed the columns so that they are lowercase, have no spaces, and are more descriptive. Not only does this create consistency, but it also allows me to use dot notation instead of bracket notation for column selection.

I also rounded the happiness column to 2 digits using the round method (since that’s enough precision for the purpose of this visualization), and I overwrote the existing happiness column with an assignment statement.


Step 3: Fix country names

First, I used the unique method to display all of the country names in the dataset.

However, not all of the country names used by the World Happiness Report (WHR) are the same as the ones used by Datawrapper. For example, the WHR uses Congo (Brazzaville) and Congo (Kinshasa), whereas Datawrapper uses Congo and Democratic Republic of Congo.

Second, I created a dictionary to map the WHR names to the Datawrapper names.

Finally, I used the replace method to update the country names, overwrote the existing column, and used the unique method to check that the operation worked.


Step 4: Filter data by year

I filtered the DataFrame using a condition so that only the rows from 2022 remained.


Step 5: Write data to CSV file

I wrote the contents of the DataFrame to a file using the to_csv method, and excluded the index from the CSV since it’s not meaningful.


Publish with Datawrapper

Once I uploaded the CSV into Datawrapper, it was easy to customize the map’s appearance and publish it online.

Again, here’s the end result, which you can click on and interact with:


If you enjoyed this week’s tip, please forward it to a friend! Takes only a few seconds, and it really helps me reach more people! 🙏

See you next Tuesday!

- Kevin

P.S. Ace your next job interview with zero knowledge

Did someone awesome forward you this email? Sign up here to receive Data Science tips every week!

Learn Data Science from Data School 📊

Kevin Markham

Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Read more from Learn Data Science from Data School 📊

Hi Reader, Last week, I recorded the FINAL 28 LESSONS 🎉 for my upcoming course, Master Machine Learning with scikit-learn. That's why you didn't hear from me last week! 😅 I edited one of those 28 videos and posted it on YouTube. That video is today's tip, which I'll tell you about below! 👉 Tip #45: How to read the scikit-learn documentation In order to become truly proficient with scikit-learn, you need to be able to read the documentation. In this video lesson, I’ll walk you through the five...

4 days ago • 1 min read

Hi Reader, happy Tuesday! My recent tips have been rather lengthy, so I'm going to mix it up with some shorter tips (like today's). Let me know what you think! 💬 🔗 Link of the week A stealth attack came close to compromising the world's computers (The Economist) If you haven't heard about the recent "xz Utils backdoor", it's an absolutely fascinating/terrifying story! In short, a hacker (or team of hackers) spent years gaining the trust of an open-source project by making helpful...

18 days ago • 1 min read

Hi Reader, Today's tip is drawn directly from my upcoming course, Master Machine Learning with scikit-learn. You can read the tip below or watch it as a video! If you're interested in receiving more free lessons from the course (which won't be included in Tuesday Tips), you can join the waitlist by clicking here: Yes, I want more free lessons! 👉 Tip #43: Should you discretize continuous features for Machine Learning? Let's say that you're working on a supervised Machine Learning problem, and...

25 days ago • 2 min read
Share this post