profile

Learn Data Science from Data School πŸ“Š

Tuesday Tip #17: Make your own *private* GPT πŸ”’

Published 11 months agoΒ β€’Β 4 min read

Hi Reader,

Can you believe that it's almost June?

If you set any Data Science goals for 2023, let me know how they're going so far! I'd love to hear from you πŸ’¬


πŸ‘‰ Tip #17: Ask questions to your private documents

ChatGPT is amazing, but its knowledge is limited to the data on which it was trained.

Wouldn't it be great if you could use the power of Large Language Models (LLMs) to interact with your own private documents, without uploading them to the web?

The great news is that you can do this TODAY! Let me show you how...

​privateGPT is an open source project that allows you to parse your own documents and interact with them using a LLM. You ask it questions, and the LLM will generate answers from your documents.

All using Python, all 100% private, all 100% free!

Below, I'll walk you through how to set it up. (Note that this will require some familiarity with the command line.)


1️⃣ Clone or download the repository

If git is installed on your computer, then navigate to an appropriate folder (perhaps "Documents") and clone the repository (git clone https://github.com/imartinez/privateGPT.git). That will create a "privateGPT" folder, so change into that folder (cd privateGPT).

Alternatively, you could download the repository as a zip file (using the green "Code" button), move the zip file to an appropriate folder, and then unzip it. It will create a folder called "privateGPT-main", which you should rename to "privateGPT". You'll then need to navigate to that folder using the command line.


2️⃣ Create and activate a new environment

I highly recommend setting up a virtual environment for this project. My tool of choice is conda, which is available through Anaconda (the full distribution) or Miniconda (a minimal installer), though many other tools are available.

If you're using conda, create an environment called "gpt" that includes the latest version of Python using conda create -n gpt python. Then, activate the environment using conda activate gpt. Use conda list to see which packages are installed in this environment.

(Note: privateGPT requires Python 3.10 or later.)


3️⃣ Install the packages listed in requirements.txt

First, make sure that "privateGPT" is your working directory using pwd. Then, make sure that "gpt" is your active environment using conda info.

Once you've done that, use pip3 install -r requirements.txt to install all of the packages listed in that file into the "gpt" environment. This will take at least a few minutes.

Use conda list to see the updated list of which packages are installed.

(Note: The System Requirements section of the README may be helpful if you run into an installation error.)


4️⃣ Download the LLM model

In the Environment Setup section of the README, there's a link to an LLM. Currently, that LLM is ggml-gpt4all-j-v1.3-groovy.bin. Download that file (3.5 GB).

Then, create a subfolder of the "privateGPT" folder called "models", and move the downloaded LLM file to "models".


5️⃣ Copy the environment file

In the "privateGPT" folder, there's a file named "example.env". Make a copy of that file named ".env" using cp example.env .env. Use ls -a to check that it worked.

(Note: This file has nothing to do with your virtual environment.)


6️⃣ Add your documents

Add your private documents to the "source_documents" folder, which is a subfolder of the "privateGPT" folder. Here's a list of the supported file types.

I recommend starting with a small number of documents so that you can quickly verify that the entire process works. (The "source_documents" folder already contains a sample document, "state_of_the_union.txt", so you can actually just start with this document if you like.)


7️⃣ Ingest your documents

Once again, make sure that "privateGPT" is your working directory using pwd.

Then, run python ingest.py to parse the documents. This may run quickly (< 1 minute) if you only added a few small documents, but it can take a very long time with larger documents.

Once this process is done, you'll notice that there's a new subfolder of "privateGPT" called "db".


8️⃣ Interact with your documents

Run python privateGPT.py to start querying your documents! Once it has loaded, you will see the text Enter a query:

Type in your question and hit enter. After a minute, it will answer your question, followed by a list of source documents that it used for context.

(Keep in mind that the LLM has "knowledge" far outside your documents, so it can answer questions that have nothing to do with the documents you provided to it.)

When you're done asking questions, just type exit.


🐞 Troubleshooting

This project is only a few weeks old, and it depends on other libraries which are also quite new! Thus it's highly likely that you will run into bugs, unexplained errors, and crashes.

For example, if you get an "unknown token" error after asking a question, my experience has been that you can ignore the error and you will still get an answer to your question.

On the other hand, if you get a memory-related error, you will need to end the process by hitting "Ctrl + C" on your keyboard. (Then, just restart it by running python privateGPT.py.)

You might be able to find a workaround to a particular problem by searching the Issues in the privateGPT repository.

If you post your own GitHub issue, please be kind! This is an open source project being run by one person in his spare time (for free)!


🎯 Usage tips

Want to query more documents? Add them to the "source_documents" folder and re-run python ingest.py.

Want to start over? Delete the "db" folder, and a new "db" folder will be created the next time you ingest documents.

Want to hide the source documents for each answer? Run python privateGPT.py -S instead of python privateGPT.py.

Want to try a different LLM? Download a different LLM to the "models" folder and reference it in the ".env" file.

Want to use the latest version of the code? Given its popularity, it's likely that this project will evolve rapidly. If you want to use the latest version of the code, run git pull origin main.


πŸ›‘ Disclaimer

I think it's worth repeating the disclaimer listed at the bottom of the repository:

This is a test project to validate the feasibility of a fully private solution for question answering using LLMs and Vector embeddings. It is not production-ready, and it is not meant to be used in production. The models selection is not optimized for performance, but for privacy; but it is possible to use different models and vector stores to improve performance.

If you enjoyed this week’s tip, please forward it to a friend!Β This one took me many hours to write, but it only takes a few seconds to forward! πŸ™Œ

See you next Tuesday!

- Kevin

P.S.Β How to write a prompt in 2028​

Did someone awesome forward you this email?Β Sign up here to receive data science tips every week!

Learn Data Science from Data School πŸ“Š

Kevin Markham

Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!

Read more from Learn Data Science from Data School πŸ“Š

Hi Reader, happy Tuesday! My recent tips have been rather lengthy, so I'm going to mix it up with some shorter tips (like today's). Let me know what you think! πŸ’¬ πŸ”— Link of the week A stealth attack came close to compromising the world's computers (The Economist) If you haven't heard about the recent "xz Utils backdoor", it's an absolutely fascinating/terrifying story! In short, a hacker (or team of hackers) spent years gaining the trust of an open-source project by making helpful...

13 days agoΒ β€’Β 1 min read

Hi Reader, Today's tip is drawn directly from my upcoming course, Master Machine Learning with scikit-learn. You can read the tip below or watch it as a video! If you're interested in receiving more free lessons from the course (which won't be included in Tuesday Tips), you can join the waitlist by clicking here: Yes, I want more free lessons! πŸ‘‰ Tip #43: Should you discretize continuous features for Machine Learning? Let's say that you're working on a supervised Machine Learning problem, and...

20 days agoΒ β€’Β 2 min read

Hi Reader, I'm so excited to share this week's tip with you! It has been in my head for months, but I finally put it in writing ✍️ It's longer than usual, so if you prefer, you can read it as a blog post instead: Jupyter & IPython terminology explained πŸ”— Link of the week Python Problem-Solving Bootcamp (April 1-21) Want to improve your Python skills quickly? There's no better way than solving problems, reviewing alternative solutions, and exchanging ideas with others. That's the idea behind...

about 1 month agoΒ β€’Β 3 min read
Share this post