Tuesday Tip #8: Build a dataset with regular expressions 👷‍♂️


Hi Reader,

Next week, I’m launching a NEW course, Become a Regex Superhero!

Learning regular expressions (also known as “regex”) will help you become a more versatile and valuable data scientist. My course will help you go from Zero to Hero! 💪

Stay tuned for more details about the course. (As a newsletter subscriber, you'll get a significant launch discount! 💸)

In today’s tip, I’m going to show you one of the many use cases for regular expressions. Please enjoy!


👉 Tip #8: Build a dataset with regex

Let’s say that you wanted to build a dataset of every Python version and its release date. This python.org web page has all of the data you need:

But how would you turn this into a structured dataset?

We can start by reading the source of the web page (meaning the HTML) into Python using the requests library:

Here’s a small portion of the HTML, which is stored in r.text:

In order to parse the HTML into something useful, we’ll use regular expressions!

Let me be clear: What I’m about to show you is NOT enough to “teach you” regular expressions. (That’s why I created a course!)

Instead, what I’m trying to show you is that regular expressions is not as scary as you might think! 👻


Extracting the dates

Here’s how we can use regex to extract the Python release dates:

We imported the re module, and then used the findall function to search the r.text string and find all occurrences of a regex pattern.

This is the pattern we searched for: \d+ \w+ \d{4}

Here’s how to decode the pattern:

  • \d means “digit character” (0 through 9)
  • \w means “word character” (letter, digit, or underscore)
  • + means “one or more”
  • {4} means “exactly 4”

Thus the pattern \d+ \w+ \d{4} can be read as “1 or more digits, then space, then 1 or more word characters, then space, then 4 digits”. And that’s how it found the dates!


Extracting the version numbers

It’s a bit more complicated to extract the version numbers because some have 2 parts (1.5), some have 3 parts (1.5.1), and some have a letter (1.5.1p1):

Here’s what we’ll do:

This is the pattern we searched for: Python (\d.+?)<

Here’s how to decode this:

  • \d means “digit character”
  • . means “any character except newline”
  • + means “one or more”
  • ? means “make the plus sign lazy” (this is a tricky concept, but it basically means "make the match as short as possible")
  • () means “only return this part of the match”

Thus the pattern Python (\d.+?)< can be read as “Python, then space, then 1 digit character, then 1 or more of any character (lazy behavior), then <, and only return the part in parentheses”.

In case you were wondering, the angle bracket (meaning the <) helps us to find the version number since it's always right before the tag in the HTML.


Creating the dataset

At this point, we can create a pandas DataFrame simply by zipping the two lists together:

Pretty cool, right? 😎

Here's the code from today's tip, in case you want to play around with it!

How useful was today's tip?

🤩🙂😐


This is just a tiny preview of the power of regular expressions!

There’s SO MUCH you can do with regex, so I hope you’ll consider joining Become a Regex Superhero when it launches next week! 🚀

- Kevin

P.S. Can you decode this tweet?

Did someone awesome forward you this email? Sign up here to receive data science tips every week!

Learn Artificial Intelligence from Data School 🤖

Join 25,000+ intelligent readers and receive AI tips every Tuesday!

Read more from Learn Artificial Intelligence from Data School 🤖

Hi Reader, I'm thrilled to announce that my new book, Master Machine Learning with scikit-learn, is now on sale! Buy from Amazon I poured my heart and soul into making this the highest quality and most practical Machine Learning book available. Publishing this book is a dream come true, and I'd be grateful if you'd consider picking up a copy! 🙏 Option 1: Get the paperback from Amazon ($19) Although most technical books of this size (300+ pages) tend to sell for at least $39, I've priced the...

Hi Reader, A few months ago, I announced that my new book, Master Machine Learning with scikit-learn, would be published in December. Since then, my personal life has undergone some dramatic changes 🥴 During the transition, it has been challenging to focus on anything other than bare life essentials 🍽️ 🛌 🚿 Thankfully, my life has begun to steady (yay!), and so in the past few weeks I've been able to wrap up some key pieces of the project! ✅ I'm thrilled to hold in my hands the FINAL proof...

Hi Reader, happy new year! 🎉 I wanted to share with you the three most important articles I found that look back at AI progress in 2025 and look forward at what is coming in 2026 and beyond. I’ve extracted the key points from each article, but if you have the time and interest, I’d encourage you to read the full articles! 💠 The Shape of AI: Jaggedness, Bottlenecks and Salients By Ethan Mollick “Jaggedness” describes the uneven abilities of AI: It’s superhuman in some areas and far below human...