Hi Reader,

**Do you like a good mystery?** Iâ€™ve got one for you below! đź‘‡

But first, I thought it would be fun to include a â€ślink of the weekâ€ť at the top of each issue. I'll share something that I think is worth checking out!

## đź”— Link of the week

â€‹Large language models, explained with a minimum of math and jargonâ€‹

This is a LONG read â€” I spent 30+ minutes to read it carefully â€” but this is an excellent place to start if you want to understand the inner workings of LLMs such as ChatGPT. *(Make sure to read the footnotes, and also the comments if you have time!)*

## đź‘‰ Tip #23: Solve a medical mystery with a confusion matrix

Throughout your life (and especially in the era of COVID! đź¦ ), youâ€™ve probably taken a bunch of diagnostic tests to detect the presence or absence of a condition.

**Have you ever wondered how much you should trust the test result?** Since youâ€™re reading a newsletter about Data Science, Iâ€™m going to guess the answer is â€śyesâ€ť!

Below, Iâ€™m going to pose a made-up testing scenario, and together weâ€™ll figure out how to use a confusion matrix to solve this mystery!

## Hereâ€™s the scenario:

Letâ€™s pretend that there's a disease that is present in **2%** of the population. (That's known as the "prevalence" of the disease.)

The disease can be detected by a nasal swab, and the test is **95% accurate** for both positive and negative cases. (More formally, the testâ€™s "True Positive Rate" and "True Negative Rate" are both 95%.)

**You were tested for the disease and received a positive test result. What is the probability that you actually have the disease?**

## Why use a confusion matrix?

There are many methods you could use to solve this problem, but Iâ€™m going to create a confusion matrix since it helps me to visualize the problem.

If youâ€™re not familiar with the confusion matrix, **itâ€™s a table that describes the performance of a ****classifier**** on a set of ****test data for which the true values are known****.**

In this case, the nasal swab is the ** classifier**, and we will fill in the

**based on the True Positive and True Negative Rates stated above.**

*test data for which the true values are known*(Side note: These rates would have been determined by the company that developed the nasal swab based on clinical trials they conducted prior to releasing the test.)

## Step 1:

Letâ€™s pretend there are 1000 people in the population. Because the stated prevalence of the disease is 2%, **we believe that the disease is present in 2% of those 1000 people, or 20 people**.

We can now begin to fill in the confusion matrix. There are two rows, â€śActual YESâ€ť and â€śActual NOâ€ť, meaning how many people actually have the disease:

- The Actual YES row contains a total of
**20**people. - The Actual NO row contains the remaining
**980**people.

## Step 2:

As stated above, **when someone actually has the disease, the nasal swab will return a positive result 95% of the time.**

In other words, it will predict YES for 95% of those people (which is the correct result), and it will predict NO for 5% of those people (which is the incorrect result).

- The Actual YES / Predicted YES box contains
**19**people, since 0.95 x 20 = 19. These are called**True Positives**(TP). - The Actual YES / Predicted NO box contains
**1**person, since 0.05 x 20 = 1. These are called**False Negatives**(FN).

## Step 3:

As stated above,** when someone does not actually have the disease, the nasal swab will return a negative result 95% of the time.**

In other words, it will predict NO for 95% of those people (which is the correct result), and it will predict YES for 5% of those people (which is the incorrect result).

- The Actual NO / Predicted NO box contains
**931**people, since 0.95 x 980 = 931. These are called**True Negatives**(TN). - The Actual NO / Predicted YES box contains
**49**people, since 0.05 x 980 = 49. These are called**False Positives**(FP).

Now that we have filled in the confusion matrix, we can also calculate the column totals for Predicted NO (931 + 1 = **932**) and Predicted YES (49 + 19 = **68**).

## Calculate the answer:

Finally, weâ€™re ready to answer the question!

**Given that you received a positive result from the nasal swab, what is the probability that you actually have the disease?**

In our population of 1000 people, **68** tested positive (indicated by the Predicted YES column total). You represent one of those 68 people.

Of those 68 people, **19** actually have the disease (True Positives) and **49** do not (False Positives).

Thus if you test positive, the probability that you actually have the disease is 19 / 68 = **27.9%**.

## Wait, what???

27.9% might â€śfeelâ€ť like the wrong answer, given that the test is 95% accurate. **But hereâ€™s the key insight:**

For a condition with low prevalence, most of the positive test results will be False Positives (even for a highly accurate test), since there are far more people taking the test who donâ€™t have the condition than do have the condition.

## Hereâ€™s my challenge to you:

One way to reduce the number of False Positives is to only test people who are symptomatic for the condition, since that significantly increases the prevalence within the population being tested.

To see an example of this, change the prevalence in the scenario above from **2%** to **30%** and then recreate the confusion matrix. **In this new scenario, what is the probability that you actually have the disease, given that you tested positive?**

Reply with your answer and Iâ€™ll let you know if youâ€™re correct! âś…

## Confusion matrix resources:

The confusion matrix is widely used in Machine Learning and many other fields. If youâ€™re interested in learning more, I highly recommend reading my blog post:

đź”— Simple guide to confusion matrix terminologyâ€‹

I also created a 35-minute video that goes deeper into this topic:

đź”— Making sense of the confusion matrixâ€‹

Finally, the Wikipedia page is also quite good:

đź”— Wikipedia: Confusion matrixâ€‹

**Keep in mind that confusion matrices are not always formatted the same way!** For example, the Wikipedia page places the YES values instead of the NO values in the upper left, so itâ€™s important to pay attention to the labels when interpreting a confusion matrix! đź’ˇ

**If you enjoyed this weekâ€™s tip, please forward it to a friend!** Takes only a few seconds, and it really helps me out! đź™Ś

See you next Tuesday!

- Kevin

P.S. You have a lot of measurements. Quite a few variables.â€‹

*Did someone awesome forward you this email? **Sign up here to receive data science tips every week**!*