Hi Reader,
Next week, I’m launching a NEW course, Become a Regex Superhero!
Learning regular expressions (also known as “regex”) will help you become a more versatile and valuable data scientist. My course will help you go from Zero to Hero! 💪
Stay tuned for more details about the course. (As a newsletter subscriber, you'll get a significant launch discount! 💸)
In today’s tip, I’m going to show you one of the many use cases for regular expressions. Please enjoy!
Let’s say that you wanted to build a dataset of every Python version and its release date. This python.org web page has all of the data you need:
But how would you turn this into a structured dataset?
We can start by reading the source of the web page (meaning the HTML) into Python using the requests library:
Here’s a small portion of the HTML, which is stored in r.text:
In order to parse the HTML into something useful, we’ll use regular expressions!
Let me be clear: What I’m about to show you is NOT enough to “teach you” regular expressions. (That’s why I created a course!)
Instead, what I’m trying to show you is that regular expressions is not as scary as you might think! 👻
Here’s how we can use regex to extract the Python release dates:
We imported the re module, and then used the findall function to search the r.text string and find all occurrences of a regex pattern.
This is the pattern we searched for: \d+ \w+ \d{4}
Here’s how to decode the pattern:
Thus the pattern \d+ \w+ \d{4} can be read as “1 or more digits, then space, then 1 or more word characters, then space, then 4 digits”. And that’s how it found the dates!
It’s a bit more complicated to extract the version numbers because some have 2 parts (1.5), some have 3 parts (1.5.1), and some have a letter (1.5.1p1):
Here’s what we’ll do:
This is the pattern we searched for: Python (\d.+?)<
Here’s how to decode this:
Thus the pattern Python (\d.+?)< can be read as “Python, then space, then 1 digit character, then 1 or more of any character (lazy behavior), then <, and only return the part in parentheses”.
In case you were wondering, the angle bracket (meaning the <) helps us to find the version number since it's always right before the </a> tag in the HTML.
At this point, we can create a pandas DataFrame simply by zipping the two lists together:
Pretty cool, right? 😎
Here's the code from today's tip, in case you want to play around with it!
How useful was today's tip?
This is just a tiny preview of the power of regular expressions!
There’s SO MUCH you can do with regex, so I hope you’ll consider joining Become a Regex Superhero when it launches next week! 🚀
- Kevin
P.S. Can you decode this tweet?
Did someone awesome forward you this email? Sign up here to receive data science tips every week!
Join 25,000+ intelligent readers and receive AI tips every Tuesday!
Hi Reader, Last week, I encouraged you to experiment with different LLMs, since there’s no one model that is superior across all use cases. Specifically, I suggested you try using Chatbot Arena, which allows you to chat with multiple models at once. It’s completely free, but has two significant disadvantages: Your chats are not private and may be used for research. It lacks the feature-rich interface provided by other LLMs. Today, I want to offer you a better method for experimenting with...
Hi Reader, Over the past 50 tips, I’ve touched on many different topics: Python, Jupyter, pandas, ML, data visualization, and so on. Going forward, I’m planning to focus mostly on Artificial Intelligence. I’m announcing this so you know what to expect, and I know what to deliver! 💌 I’ll also try to make the tips shorter, so that they're easier to digest on-the-go. Finally, I plan to include an “action item” each week, so that you can practice what you’re learning. I hope you like these...
Hi Reader, Next week, I’ll be offering a Black Friday sale on ALL of my courses. I’ll send you the details this Thursday! 🚨 👉 Tip #50: What is a "method" in Python? In Python, a method is a function that can be used on an object because of the object's type. For example, if you create a Python list, the "append" method can be used on that list. All lists have an "append" method simply because they are lists: If you create a Python string, the "upper" method can be used on that string simply...