Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!
Hi Reader,
Next week, I’m launching a NEW course, Become a Regex Superhero!
Learning regular expressions (also known as “regex”) will help you become a more versatile and valuable data scientist. My course will help you go from Zero to Hero! 💪
Stay tuned for more details about the course. (As a newsletter subscriber, you'll get a significant launch discount! 💸)
In today’s tip, I’m going to show you one of the many use cases for regular expressions. Please enjoy!
Let’s say that you wanted to build a dataset of every Python version and its release date. This python.org web page has all of the data you need:
But how would you turn this into a structured dataset?
We can start by reading the source of the web page (meaning the HTML) into Python using the requests library:
Here’s a small portion of the HTML, which is stored in r.text:
In order to parse the HTML into something useful, we’ll use regular expressions!
Let me be clear: What I’m about to show you is NOT enough to “teach you” regular expressions. (That’s why I created a course!)
Instead, what I’m trying to show you is that regular expressions is not as scary as you might think! 👻
Here’s how we can use regex to extract the Python release dates:
We imported the re module, and then used the findall function to search the r.text string and find all occurrences of a regex pattern.
This is the pattern we searched for: \d+ \w+ \d{4}
Here’s how to decode the pattern:
Thus the pattern \d+ \w+ \d{4} can be read as “1 or more digits, then space, then 1 or more word characters, then space, then 4 digits”. And that’s how it found the dates!
It’s a bit more complicated to extract the version numbers because some have 2 parts (1.5), some have 3 parts (1.5.1), and some have a letter (1.5.1p1):
Here’s what we’ll do:
This is the pattern we searched for: Python (\d.+?)<
Here’s how to decode this:
Thus the pattern Python (\d.+?)< can be read as “Python, then space, then 1 digit character, then 1 or more of any character (lazy behavior), then <, and only return the part in parentheses”.
In case you were wondering, the angle bracket (meaning the <) helps us to find the version number since it's always right before the </a> tag in the HTML.
At this point, we can create a pandas DataFrame simply by zipping the two lists together:
Pretty cool, right? 😎
Here's the code from today's tip, in case you want to play around with it!
How useful was today's tip?
This is just a tiny preview of the power of regular expressions!
There’s SO MUCH you can do with regex, so I hope you’ll consider joining Become a Regex Superhero when it launches next week! 🚀
- Kevin
P.S. Can you decode this tweet?
Did someone awesome forward you this email? Sign up here to receive data science tips every week!
Kevin Markham
Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!
Hi Reader, happy Tuesday! My recent tips have been rather lengthy, so I'm going to mix it up with some shorter tips (like today's). Let me know what you think! 💬 🔗 Link of the week A stealth attack came close to compromising the world's computers (The Economist) If you haven't heard about the recent "xz Utils backdoor", it's an absolutely fascinating/terrifying story! In short, a hacker (or team of hackers) spent years gaining the trust of an open-source project by making helpful...
Hi Reader, Today's tip is drawn directly from my upcoming course, Master Machine Learning with scikit-learn. You can read the tip below or watch it as a video! If you're interested in receiving more free lessons from the course (which won't be included in Tuesday Tips), you can join the waitlist by clicking here: Yes, I want more free lessons! 👉 Tip #43: Should you discretize continuous features for Machine Learning? Let's say that you're working on a supervised Machine Learning problem, and...
Hi Reader, I'm so excited to share this week's tip with you! It has been in my head for months, but I finally put it in writing ✍️ It's longer than usual, so if you prefer, you can read it as a blog post instead: Jupyter & IPython terminology explained 🔗 Link of the week Python Problem-Solving Bootcamp (April 1-21) Want to improve your Python skills quickly? There's no better way than solving problems, reviewing alternative solutions, and exchanging ideas with others. That's the idea behind...