Tuesday Tip #8: Build a dataset with regular expressions 👷‍♂️


Hi Reader,

Next week, I’m launching a NEW course, Become a Regex Superhero!

Learning regular expressions (also known as “regex”) will help you become a more versatile and valuable data scientist. My course will help you go from Zero to Hero! 💪

Stay tuned for more details about the course. (As a newsletter subscriber, you'll get a significant launch discount! 💸)

In today’s tip, I’m going to show you one of the many use cases for regular expressions. Please enjoy!


👉 Tip #8: Build a dataset with regex

Let’s say that you wanted to build a dataset of every Python version and its release date. This python.org web page has all of the data you need:

But how would you turn this into a structured dataset?

We can start by reading the source of the web page (meaning the HTML) into Python using the requests library:

Here’s a small portion of the HTML, which is stored in r.text:

In order to parse the HTML into something useful, we’ll use regular expressions!

Let me be clear: What I’m about to show you is NOT enough to “teach you” regular expressions. (That’s why I created a course!)

Instead, what I’m trying to show you is that regular expressions is not as scary as you might think! 👻


Extracting the dates

Here’s how we can use regex to extract the Python release dates:

We imported the re module, and then used the findall function to search the r.text string and find all occurrences of a regex pattern.

This is the pattern we searched for: \d+ \w+ \d{4}

Here’s how to decode the pattern:

  • \d means “digit character” (0 through 9)
  • \w means “word character” (letter, digit, or underscore)
  • + means “one or more”
  • {4} means “exactly 4”

Thus the pattern \d+ \w+ \d{4} can be read as “1 or more digits, then space, then 1 or more word characters, then space, then 4 digits”. And that’s how it found the dates!


Extracting the version numbers

It’s a bit more complicated to extract the version numbers because some have 2 parts (1.5), some have 3 parts (1.5.1), and some have a letter (1.5.1p1):

Here’s what we’ll do:

This is the pattern we searched for: Python (\d.+?)<

Here’s how to decode this:

  • \d means “digit character”
  • . means “any character except newline”
  • + means “one or more”
  • ? means “make the plus sign lazy” (this is a tricky concept, but it basically means "make the match as short as possible")
  • () means “only return this part of the match”

Thus the pattern Python (\d.+?)< can be read as “Python, then space, then 1 digit character, then 1 or more of any character (lazy behavior), then <, and only return the part in parentheses”.

In case you were wondering, the angle bracket (meaning the <) helps us to find the version number since it's always right before the </a> tag in the HTML.


Creating the dataset

At this point, we can create a pandas DataFrame simply by zipping the two lists together:

Pretty cool, right? 😎

Here's the code from today's tip, in case you want to play around with it!

How useful was today's tip?

🤩🙂😐


This is just a tiny preview of the power of regular expressions!

There’s SO MUCH you can do with regex, so I hope you’ll consider joining Become a Regex Superhero when it launches next week! 🚀

- Kevin

P.S. Can you decode this tweet?

Did someone awesome forward you this email? Sign up here to receive data science tips every week!

Learn Data Science from Data School 📊

Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Read more from Learn Data Science from Data School 📊

Hi Reader, Regardless of whether you enrolled, thanks for sticking with me through the launch of my new course! 🚀 I've already started exploring topics for the next course... 😄 🔗 Link of the week git cheat sheet (PDF) A well-organized and highly readable cheat sheet from Julia Evans, the brilliant mind behind Wizard Zines! 👉 Tip #48: Three ways to set your environment variables in Python I was playing around with Mistral LLM this weekend (via LangChain in Python), and I needed to set an...

Hi Reader, Last week, I announced that a new course is coming soon and invited you to guess the topic. Hundreds of guesses were submitted, and four people who guessed correctly got the course for free! (I've already notified the winners.) I'll tell you about the course next week. In the meantime, I've got a new Tuesday Tip for you! 👇 🔗 Link of the week OpenAI just unleashed an alien of extraordinary ability (Understanding AI) If you're curious about what makes OpenAI's new "o1" models so...

Hi Reader, I'm really proud of this week's tip because it covers a topic (data leakage) that took me years to fully understand. 🧠 It's one of those times when I feel like I'm truly contributing to the collective wisdom by distilling complex ideas into an approachable format. 💡 You can read the tip below 👇 or on my blog. 🔗 Link of the week Building an AI Coach to Help Tame My Monkey Mind (Eugene Yan) In this short post, Eugene describes his experiences calling an LLM on the phone for coaching:...