Hi Reader,
Can you believe that it's almost June?
If you set any Data Science goals for 2023, let me know how they're going so far! I'd love to hear from you π¬
ChatGPT is amazing, but its knowledge is limited to the data on which it was trained.
Wouldn't it be great if you could use the power of Large Language Models (LLMs) to interact with your own private documents, without uploading them to the web?
The great news is that you can do this TODAY! Let me show you how...
βprivateGPT is an open source project that allows you to parse your own documents and interact with them using a LLM. You ask it questions, and the LLM will generate answers from your documents.
All using Python, all 100% private, all 100% free!
Below, I'll walk you through how to set it up. (Note that this will require some familiarity with the command line.)
If git is installed on your computer, then navigate to an appropriate folder (perhaps "Documents") and clone the repository (git clone https://github.com/imartinez/privateGPT.git). That will create a "privateGPT" folder, so change into that folder (cd privateGPT).
Alternatively, you could download the repository as a zip file (using the green "Code" button), move the zip file to an appropriate folder, and then unzip it. It will create a folder called "privateGPT-main", which you should rename to "privateGPT". You'll then need to navigate to that folder using the command line.
I highly recommend setting up a virtual environment for this project. My tool of choice is conda, which is available through Anaconda (the full distribution) or Miniconda (a minimal installer), though many other tools are available.
If you're using conda, create an environment called "gpt" that includes the latest version of Python using conda create -n gpt python. Then, activate the environment using conda activate gpt. Use conda list to see which packages are installed in this environment.
(Note: privateGPT requires Python 3.10 or later.)
First, make sure that "privateGPT" is your working directory using pwd. Then, make sure that "gpt" is your active environment using conda info.
Once you've done that, use pip3 install -r requirements.txt to install all of the packages listed in that file into the "gpt" environment. This will take at least a few minutes.
Use conda list to see the updated list of which packages are installed.
(Note: The System Requirements section of the README may be helpful if you run into an installation error.)
In the Environment Setup section of the README, there's a link to an LLM. Currently, that LLM is ggml-gpt4all-j-v1.3-groovy.bin. Download that file (3.5 GB).
Then, create a subfolder of the "privateGPT" folder called "models", and move the downloaded LLM file to "models".
In the "privateGPT" folder, there's a file named "example.env". Make a copy of that file named ".env" using cp example.env .env. Use ls -a to check that it worked.
(Note: This file has nothing to do with your virtual environment.)
Add your private documents to the "source_documents" folder, which is a subfolder of the "privateGPT" folder. Here's a list of the supported file types.
I recommend starting with a small number of documents so that you can quickly verify that the entire process works. (The "source_documents" folder already contains a sample document, "state_of_the_union.txt", so you can actually just start with this document if you like.)
Once again, make sure that "privateGPT" is your working directory using pwd.
Then, run python ingest.py to parse the documents. This may run quickly (< 1 minute) if you only added a few small documents, but it can take a very long time with larger documents.
Once this process is done, you'll notice that there's a new subfolder of "privateGPT" called "db".
Run python privateGPT.py to start querying your documents! Once it has loaded, you will see the text Enter a query:
Type in your question and hit enter. After a minute, it will answer your question, followed by a list of source documents that it used for context.
(Keep in mind that the LLM has "knowledge" far outside your documents, so it can answer questions that have nothing to do with the documents you provided to it.)
When you're done asking questions, just type exit.
This project is only a few weeks old, and it depends on other libraries which are also quite new! Thus it's highly likely that you will run into bugs, unexplained errors, and crashes.
For example, if you get an "unknown token" error after asking a question, my experience has been that you can ignore the error and you will still get an answer to your question.
On the other hand, if you get a memory-related error, you will need to end the process by hitting "Ctrl + C" on your keyboard. (Then, just restart it by running python privateGPT.py.)
You might be able to find a workaround to a particular problem by searching the Issues in the privateGPT repository.
If you post your own GitHub issue, please be kind! This is an open source project being run by one person in his spare time (for free)!
Want to query more documents? Add them to the "source_documents" folder and re-run python ingest.py.
Want to start over? Delete the "db" folder, and a new "db" folder will be created the next time you ingest documents.
Want to hide the source documents for each answer? Run python privateGPT.py -S instead of python privateGPT.py.
Want to try a different LLM? Download a different LLM to the "models" folder and reference it in the ".env" file.
Want to use the latest version of the code? Given its popularity, it's likely that this project will evolve rapidly. If you want to use the latest version of the code, run git pull origin main.
I think it's worth repeating the disclaimer listed at the bottom of the repository:
If you enjoyed this weekβs tip, please forward it to a friend!Β This one took me many hours to write, but it only takes a few seconds to forward! π
See you next Tuesday!
- Kevin
P.S.Β How to write a prompt in 2028β
Did someone awesome forward you this email?Β Sign up here to receive data science tips every week!
Join 25,000+ aspiring Data Scientists and receive Python & Data Science tips every Tuesday!
Hi Reader, Next week, Iβll be offering a Black Friday sale on ALL of my courses. Iβll send you the details this Thursday! π¨ π Tip #50: What is a "method" in Python? In Python, a method is a function that can be used on an object because of the object's type. For example, if you create a Python list, the "append" method can be used on that list. All lists have an "append" method simply because they are lists: If you create a Python string, the "upper" method can be used on that string simply...
Hi Reader, I appreciate everyone who has emailed to check on me and my family post-Helene! It has been more than 6 weeks since the hurricane, and most homes in Asheville (mine included) still don't have clean, running water. We're hopeful that water service will return within the next month. In the meantime, we're grateful for all of the aid agencies providing free bottled water, free meals, places to shower, and so much more. β€οΈ Thanks for allowing me to share a bit of my personal life with...
Hi Reader, Regardless of whether you enrolled, thanks for sticking with me through the launch of my new course! π I've already started exploring topics for the next course... π π Link of the week git cheat sheet (PDF) A well-organized and highly readable cheat sheet from Julia Evans, the brilliant mind behind Wizard Zines! π Tip #48: Three ways to set your environment variables in Python I was playing around with Mistral LLM this weekend (via LangChain in Python), and I needed to set an...