Hi Reader,
Do you like a good mystery? I’ve got one for you below! 👇
But first, I thought it would be fun to include a “link of the week” at the top of each issue. I'll share something that I think is worth checking out!
Large language models, explained with a minimum of math and jargon
This is a LONG read — I spent 30+ minutes to read it carefully — but this is an excellent place to start if you want to understand the inner workings of LLMs such as ChatGPT. (Make sure to read the footnotes, and also the comments if you have time!)
Throughout your life (and especially in the era of COVID! 🦠), you’ve probably taken a bunch of diagnostic tests to detect the presence or absence of a condition.
Have you ever wondered how much you should trust the test result? Since you’re reading a newsletter about Data Science, I’m going to guess the answer is “yes”!
Below, I’m going to pose a made-up testing scenario, and together we’ll figure out how to use a confusion matrix to solve this mystery!
Let’s pretend that there's a disease that is present in 2% of the population. (That's known as the "prevalence" of the disease.)
The disease can be detected by a nasal swab, and the test is 95% accurate for both positive and negative cases. (More formally, the test’s "True Positive Rate" and "True Negative Rate" are both 95%.)
You were tested for the disease and received a positive test result. What is the probability that you actually have the disease?
There are many methods you could use to solve this problem, but I’m going to create a confusion matrix since it helps me to visualize the problem.
If you’re not familiar with the confusion matrix, it’s a table that describes the performance of a classifier on a set of test data for which the true values are known.
In this case, the nasal swab is the classifier, and we will fill in the test data for which the true values are known based on the True Positive and True Negative Rates stated above.
(Side note: These rates would have been determined by the company that developed the nasal swab based on clinical trials they conducted prior to releasing the test.)
Let’s pretend there are 1000 people in the population. Because the stated prevalence of the disease is 2%, we believe that the disease is present in 2% of those 1000 people, or 20 people.
We can now begin to fill in the confusion matrix. There are two rows, “Actual YES” and “Actual NO”, meaning how many people actually have the disease:
As stated above, when someone actually has the disease, the nasal swab will return a positive result 95% of the time.
In other words, it will predict YES for 95% of those people (which is the correct result), and it will predict NO for 5% of those people (which is the incorrect result).
As stated above, when someone does not actually have the disease, the nasal swab will return a negative result 95% of the time.
In other words, it will predict NO for 95% of those people (which is the correct result), and it will predict YES for 5% of those people (which is the incorrect result).
Now that we have filled in the confusion matrix, we can also calculate the column totals for Predicted NO (931 + 1 = 932) and Predicted YES (49 + 19 = 68).
Finally, we’re ready to answer the question!
Given that you received a positive result from the nasal swab, what is the probability that you actually have the disease?
In our population of 1000 people, 68 tested positive (indicated by the Predicted YES column total). You represent one of those 68 people.
Of those 68 people, 19 actually have the disease (True Positives) and 49 do not (False Positives).
Thus if you test positive, the probability that you actually have the disease is 19 / 68 = 27.9%.
27.9% might “feel” like the wrong answer, given that the test is 95% accurate. But here’s the key insight:
For a condition with low prevalence, most of the positive test results will be False Positives (even for a highly accurate test), since there are far more people taking the test who don’t have the condition than do have the condition.
One way to reduce the number of False Positives is to only test people who are symptomatic for the condition, since that significantly increases the prevalence within the population being tested.
To see an example of this, change the prevalence in the scenario above from 2% to 30% and then recreate the confusion matrix. In this new scenario, what is the probability that you actually have the disease, given that you tested positive?
Reply with your answer and I’ll let you know if you’re correct! ✅
The confusion matrix is widely used in Machine Learning and many other fields. If you’re interested in learning more, I highly recommend reading my blog post:
🔗 Simple guide to confusion matrix terminology
I also created a 35-minute video that goes deeper into this topic:
🔗 Making sense of the confusion matrix
Finally, the Wikipedia page is also quite good:
🔗 Wikipedia: Confusion matrix
Keep in mind that confusion matrices are not always formatted the same way! For example, the Wikipedia page places the YES values instead of the NO values in the upper left, so it’s important to pay attention to the labels when interpreting a confusion matrix! 💡
If you enjoyed this week’s tip, please forward it to a friend! Takes only a few seconds, and it really helps me out! 🙌
See you next Tuesday!
- Kevin
P.S. You have a lot of measurements. Quite a few variables.
Did someone awesome forward you this email? Sign up here to receive data science tips every week!
Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!
Hi Reader, Last week, I announced that a new course is coming soon and invited you to guess the topic. Hundreds of guesses were submitted, and four people who guessed correctly got the course for free! (I've already notified the winners.) I'll tell you about the course next week. In the meantime, I've got a new Tuesday Tip for you! 👇 🔗 Link of the week OpenAI just unleashed an alien of extraordinary ability (Understanding AI) If you're curious about what makes OpenAI's new "o1" models so...
Hi Reader, I'm really proud of this week's tip because it covers a topic (data leakage) that took me years to fully understand. 🧠 It's one of those times when I feel like I'm truly contributing to the collective wisdom by distilling complex ideas into an approachable format. 💡 You can read the tip below 👇 or on my blog. 🔗 Link of the week Building an AI Coach to Help Tame My Monkey Mind (Eugene Yan) In this short post, Eugene describes his experiences calling an LLM on the phone for coaching:...
Hi Reader, Last week, I recorded the FINAL 28 LESSONS 🎉 for my upcoming course, Master Machine Learning with scikit-learn. That's why you didn't hear from me last week! 😅 I edited one of those 28 videos and posted it on YouTube. That video is today's tip, which I'll tell you about below! 👉 Tip #45: How to read the scikit-learn documentation In order to become truly proficient with scikit-learn, you need to be able to read the documentation. In this video lesson, I’ll walk you through the five...