profile

Learn Data Science from Data School 📊

Tuesday Tip #8: Build a dataset with regular expressions 👷‍♂️

Published about 1 year ago • 2 min read

Hi Reader,

Next week, I’m launching a NEW course, Become a Regex Superhero!

Learning regular expressions (also known as “regex”) will help you become a more versatile and valuable data scientist. My course will help you go from Zero to Hero! 💪

Stay tuned for more details about the course. (As a newsletter subscriber, you'll get a significant launch discount! 💸)

In today’s tip, I’m going to show you one of the many use cases for regular expressions. Please enjoy!


👉 Tip #8: Build a dataset with regex

Let’s say that you wanted to build a dataset of every Python version and its release date. This python.org web page has all of the data you need:

But how would you turn this into a structured dataset?

We can start by reading the source of the web page (meaning the HTML) into Python using the requests library:

Here’s a small portion of the HTML, which is stored in r.text:

In order to parse the HTML into something useful, we’ll use regular expressions!

Let me be clear: What I’m about to show you is NOT enough to “teach you” regular expressions. (That’s why I created a course!)

Instead, what I’m trying to show you is that regular expressions is not as scary as you might think! 👻


Extracting the dates

Here’s how we can use regex to extract the Python release dates:

We imported the re module, and then used the findall function to search the r.text string and find all occurrences of a regex pattern.

This is the pattern we searched for: \d+ \w+ \d{4}

Here’s how to decode the pattern:

  • \d means “digit character” (0 through 9)
  • \w means “word character” (letter, digit, or underscore)
  • + means “one or more”
  • {4} means “exactly 4”

Thus the pattern \d+ \w+ \d{4} can be read as “1 or more digits, then space, then 1 or more word characters, then space, then 4 digits”. And that’s how it found the dates!


Extracting the version numbers

It’s a bit more complicated to extract the version numbers because some have 2 parts (1.5), some have 3 parts (1.5.1), and some have a letter (1.5.1p1):

Here’s what we’ll do:

This is the pattern we searched for: Python (\d.+?)<

Here’s how to decode this:

  • \d means “digit character”
  • . means “any character except newline”
  • + means “one or more”
  • ? means “make the plus sign lazy” (this is a tricky concept, but it basically means "make the match as short as possible")
  • () means “only return this part of the match”

Thus the pattern Python (\d.+?)< can be read as “Python, then space, then 1 digit character, then 1 or more of any character (lazy behavior), then <, and only return the part in parentheses”.

In case you were wondering, the angle bracket (meaning the <) helps us to find the version number since it's always right before the </a> tag in the HTML.


Creating the dataset

At this point, we can create a pandas DataFrame simply by zipping the two lists together:

Pretty cool, right? 😎

Here's the code from today's tip, in case you want to play around with it!

How useful was today's tip?

🤩🙂😐


This is just a tiny preview of the power of regular expressions!

There’s SO MUCH you can do with regex, so I hope you’ll consider joining Become a Regex Superhero when it launches next week! 🚀

- Kevin

P.S. Can you decode this tweet?

Did someone awesome forward you this email? Sign up here to receive data science tips every week!

Learn Data Science from Data School 📊

Kevin Markham

Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Read more from Learn Data Science from Data School 📊

Hi Reader, happy Tuesday! My recent tips have been rather lengthy, so I'm going to mix it up with some shorter tips (like today's). Let me know what you think! 💬 🔗 Link of the week A stealth attack came close to compromising the world's computers (The Economist) If you haven't heard about the recent "xz Utils backdoor", it's an absolutely fascinating/terrifying story! In short, a hacker (or team of hackers) spent years gaining the trust of an open-source project by making helpful...

11 days ago • 1 min read

Hi Reader, Today's tip is drawn directly from my upcoming course, Master Machine Learning with scikit-learn. You can read the tip below or watch it as a video! If you're interested in receiving more free lessons from the course (which won't be included in Tuesday Tips), you can join the waitlist by clicking here: Yes, I want more free lessons! 👉 Tip #43: Should you discretize continuous features for Machine Learning? Let's say that you're working on a supervised Machine Learning problem, and...

18 days ago • 2 min read

Hi Reader, I'm so excited to share this week's tip with you! It has been in my head for months, but I finally put it in writing ✍️ It's longer than usual, so if you prefer, you can read it as a blog post instead: Jupyter & IPython terminology explained 🔗 Link of the week Python Problem-Solving Bootcamp (April 1-21) Want to improve your Python skills quickly? There's no better way than solving problems, reviewing alternative solutions, and exchanging ideas with others. That's the idea behind...

about 1 month ago • 3 min read
Share this post