Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Published 7 months agoÂ â€˘Â 4 min read

Hi Reader,

**Do you like a good mystery?** Iâ€™ve got one for you below! đź‘‡

But first, I thought it would be fun to include a â€ślink of the weekâ€ť at the top of each issue. I'll share something that I think is worth checking out!

â€‹Large language models, explained with a minimum of math and jargonâ€‹

This is a LONG read â€” I spent 30+ minutes to read it carefully â€” but this is an excellent place to start if you want to understand the inner workings of LLMs such as ChatGPT. *(Make sure to read the footnotes, and also the comments if you have time!)*

Throughout your life (and especially in the era of COVID! đź¦ ), youâ€™ve probably taken a bunch of diagnostic tests to detect the presence or absence of a condition.

**Have you ever wondered how much you should trust the test result?** Since youâ€™re reading a newsletter about Data Science, Iâ€™m going to guess the answer is â€śyesâ€ť!

Below, Iâ€™m going to pose a made-up testing scenario, and together weâ€™ll figure out how to use a confusion matrix to solve this mystery!

Letâ€™s pretend that there's a disease that is present in **2%** of the population. (That's known as the "prevalence" of the disease.)

The disease can be detected by a nasal swab, and the test is **95% accurate** for both positive and negative cases. (More formally, the testâ€™s "True Positive Rate" and "True Negative Rate" are both 95%.)

**You were tested for the disease and received a positive test result. What is the probability that you actually have the disease?**

There are many methods you could use to solve this problem, but Iâ€™m going to create a confusion matrix since it helps me to visualize the problem.

If youâ€™re not familiar with the confusion matrix, **itâ€™s a table that describes the performance of a ****classifier**** on a set of ****test data for which the true values are known****.**

In this case, the nasal swab is the ** classifier**, and we will fill in the

(Side note: These rates would have been determined by the company that developed the nasal swab based on clinical trials they conducted prior to releasing the test.)

Letâ€™s pretend there are 1000 people in the population. Because the stated prevalence of the disease is 2%, **we believe that the disease is present in 2% of those 1000 people, or 20 people**.

We can now begin to fill in the confusion matrix. There are two rows, â€śActual YESâ€ť and â€śActual NOâ€ť, meaning how many people actually have the disease:

- The Actual YES row contains a total of
**20**people. - The Actual NO row contains the remaining
**980**people.

As stated above, **when someone actually has the disease, the nasal swab will return a positive result 95% of the time.**

In other words, it will predict YES for 95% of those people (which is the correct result), and it will predict NO for 5% of those people (which is the incorrect result).

- The Actual YES / Predicted YES box contains
**19**people, since 0.95 x 20 = 19. These are called**True Positives**(TP). - The Actual YES / Predicted NO box contains
**1**person, since 0.05 x 20 = 1. These are called**False Negatives**(FN).

As stated above,** when someone does not actually have the disease, the nasal swab will return a negative result 95% of the time.**

In other words, it will predict NO for 95% of those people (which is the correct result), and it will predict YES for 5% of those people (which is the incorrect result).

- The Actual NO / Predicted NO box contains
**931**people, since 0.95 x 980 = 931. These are called**True Negatives**(TN). - The Actual NO / Predicted YES box contains
**49**people, since 0.05 x 980 = 49. These are called**False Positives**(FP).

Now that we have filled in the confusion matrix, we can also calculate the column totals for Predicted NO (931 + 1 = **932**) and Predicted YES (49 + 19 = **68**).

Finally, weâ€™re ready to answer the question!

**Given that you received a positive result from the nasal swab, what is the probability that you actually have the disease?**

In our population of 1000 people, **68** tested positive (indicated by the Predicted YES column total). You represent one of those 68 people.

Of those 68 people, **19** actually have the disease (True Positives) and **49** do not (False Positives).

Thus if you test positive, the probability that you actually have the disease is 19 / 68 = **27.9%**.

27.9% might â€śfeelâ€ť like the wrong answer, given that the test is 95% accurate. **But hereâ€™s the key insight:**

For a condition with low prevalence, most of the positive test results will be False Positives (even for a highly accurate test), since there are far more people taking the test who donâ€™t have the condition than do have the condition.

One way to reduce the number of False Positives is to only test people who are symptomatic for the condition, since that significantly increases the prevalence within the population being tested.

To see an example of this, change the prevalence in the scenario above from **2%** to **30%** and then recreate the confusion matrix. **In this new scenario, what is the probability that you actually have the disease, given that you tested positive?**

Reply with your answer and Iâ€™ll let you know if youâ€™re correct! âś…

The confusion matrix is widely used in Machine Learning and many other fields. If youâ€™re interested in learning more, I highly recommend reading my blog post:

đź”— Simple guide to confusion matrix terminologyâ€‹

I also created a 35-minute video that goes deeper into this topic:

đź”— Making sense of the confusion matrixâ€‹

Finally, the Wikipedia page is also quite good:

đź”— Wikipedia: Confusion matrixâ€‹

**Keep in mind that confusion matrices are not always formatted the same way!** For example, the Wikipedia page places the YES values instead of the NO values in the upper left, so itâ€™s important to pay attention to the labels when interpreting a confusion matrix! đź’ˇ

**If you enjoyed this weekâ€™s tip, please forward it to a friend!** Takes only a few seconds, and it really helps me out! đź™Ś

See you next Tuesday!

- Kevin

P.S. You have a lot of measurements. Quite a few variables.â€‹

*Did someone awesome forward you this email? **Sign up here to receive data science tips every week**!*

Kevin Markham

Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Read more from Learn Data Science from Data School đź“Š

Hi Reader, happy Tuesday! My recent tips have been rather lengthy, so I'm going to mix it up with some shorter tips (like today's). Let me know what you think! đź’¬ đź”— Link of the week A stealth attack came close to compromising the world's computers (The Economist) If you haven't heard about the recent "xz Utils backdoor", it's an absolutely fascinating/terrifying story! In short, a hacker (or team of hackers) spent years gaining the trust of an open-source project by making helpful...

13 days agoÂ â€˘Â 1 min read

Hi Reader, Today's tip is drawn directly from my upcoming course, Master Machine Learning with scikit-learn. You can read the tip below or watch it as a video! If you're interested in receiving more free lessons from the course (which won't be included in Tuesday Tips), you can join the waitlist by clicking here: Yes, I want more free lessons! đź‘‰ Tip #43: Should you discretize continuous features for Machine Learning? Let's say that you're working on a supervised Machine Learning problem, and...

20 days agoÂ â€˘Â 2 min read

Hi Reader, I'm so excited to share this week's tip with you! It has been in my head for months, but I finally put it in writing âśŤď¸Ź It's longer than usual, so if you prefer, you can read it as a blog post instead: Jupyter & IPython terminology explained đź”— Link of the week Python Problem-Solving Bootcamp (April 1-21) Want to improve your Python skills quickly? There's no better way than solving problems, reviewing alternative solutions, and exchanging ideas with others. That's the idea behind...

about 1 month agoÂ â€˘Â 3 min read