3,011 reads

PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables

by CognitiveTechJanuary 15th, 2024

Too Long; Didn't Read

There are many variables when implementing large language models. Lets test and refine our processes for summarizing books using PrivateGPT, powered by NVIDIA RTX 3060 12GB.

People Mentioned

featured image - PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables

I started to summarize a dozen books by hand and found it was going to take me weeks for each summary. Then I remembered about this AI revolution happening and decided I was long past due to jump into these waters.

When I began exploring the use of large language models (LLM) for summarizing large texts, I found no clear direction on how to do so.

Some pages give example prompts to give GPT4 with the idea that it will magically know the contents of whatever book you want summarized. (NOT)
Some people suggested i need to find a model with a large context that can process my whole text in one go. (Not Yet)
Some open source tools are available that allow you to upload documents to a database and answer questions based on the contents of that database. (Getting Closer)
Others have suggested that you must first divide the book into sections and feed them into the LLM for summarization one at a time. (Now we’re talking)

Beyond that determination, there are numerous variables which must be accounted for when implementing a given LLM.

I quickly realized, despite any recommendations or model rankings available, I was getting different results than what others have.

Whether its my use-case, the model format, quantization, compression, prompt styles, or what? I don’t know. All I know is, do your own model rankings under your own working conditions. Don’t just believe some chart you read online.

This guide provides some specifics into my process of determination and testing out the details of above mentioned variables.

Background

Key Terms

Some of these terms are used in different ways, depending on the context (no pun intended).

Large Language Model (LLM): (AKA Model) A type of Artificial Intelligence that has been trained upon massive datasets to understand and generate human language.

Example: OpenAI’s GPT3.5 and GPT4 which have taken the world by storm. (In our case we are choosing among open source and\or freely downloadable models found on .)
Retrieval Augmented Generation (RAG): A technique, , of storing documents in a database that the LLM searches among to find an answer for a given user query (Document Q/A).
User Instructions: (AKA Prompt, or Context) is the query provided by the user.

Example: “Summarize the following text : { text }”
System Prompt: Special instructions given before the user prompt, that helps shape the personality of your assistant.
Example: “You are a helpful AI Assistant.”
Context: User instructions, and possibly a system prompt, and possibly previous rounds of question\answer pairs. (Previous Q/A pairs are also referred to simply as context).
Prompt Style: These are special character combinations that a LLM is trained with to recognize the difference between user instructions, system prompt and context from previous questions.

Example: <s>[INST] {systemPrompt} [/INST] [INST] {previousQuestion} [/INST] {answer} </s> [INST] {userInstructions} [/INST]
7B: Indicates the number of parameters in a given model (higher is generally better). Parameters are the internal variables that the model learns during training and are used to make predictions. For my purposes, 7B models are likely to fit on a my GPU with 12GB VRAM.
GGUF: This is a specific format for LLM designed for consumer hardware (CPU/GPU). Whatever model you are interested in, for use in PrivateGPT, you must find its GGUF version (commonly made by ).
Q2-Q8 0, K_M or K_S: When browsing the files of a GGUF repository you will see different versions of the same model. A higher number means less compressed, and better quality. The M in K_M means “Medium” and the S in K_S means “Small”.
VRAM: This is the memory capacity of your GPU. To load it completely to GPU, you will want a model smaller size than your available VRAM.
Tokens: This is the metric LLM weighs language with. Each token consists of roughly 4 characters.

What is PrivateGPT?

PrivateGPT (pgpt) is an that provides a user-interface and programmable API enabling users to use LLM with own hardware, at home. It allows you to upload documents to your own local database for RAG supported Document Q/A.

PrivateGPT provides an API containing all the building blocks required to build private, context-aware AI applications. The API follows and extends OpenAI API standard, and supports both normal and streaming responses. That means that, if you can use OpenAI API in one of your tools, you can use your own PrivateGPT API instead, with no code changes, and for free if you are running privateGPT in local mode.

Overview

I began by asking questions to book chapters, using the UI\RAG. Then tried pre-selecting chunks of text for summarization. This inspired Round 1 rankings: Q/A vs Summarization.
Next I wanted to find which models would do the best with this task, which led to Round 2 rankings, where was the clear winner.
Then I wanted to get the best results from this model by ranking prompt styles, and writing code to get the exact prompt style expected.
After that, of course, I had to test out various system prompts to see which would perform the best.
Next, I tried a few, user prompts, to determine what is the exact best prompt to generate summaries that require the least post-processing, by me.

Only once each model has been targeted to its most ideal conditions can they be properly ranked against each-other.

Rankings

When i began testing various LLM variants, mistral-7b-instruct-v0.1.Q4_K_M.gguf came as part of PrivateGPT's default setup (made to run on your CPU). Here, I've preferred the Q8_0 variants.

While I've tried 50+ different Q8 GGUF for this same task, and haven’t found any better than .

Round 1 - Q/A vs Summary

I quickly discovered when doing Q/A is that I get much better results when uploading smaller chunks of data into the database, and starting with a clean slate each time. So I began splitting PDF into chapters for Q/A purposes.

For my first analysis I tested out 5 different LLM for the following tasks:

Asking the same 30 questions to a 70 page book chapter.
Summarizing that same 70 page book chapter, divided into 30 chunks.

Question / Answer Ranking

- My favorite, during these tests, but when actually editing the summaries I decided it was too verbose.
- Became my favorite of models tested in this round.
- Not as good as I’d like.
Alot of filler and took the longest amount of time of them all. It scored a bit higher than mistral on quality\usefulness, but the amount of filler just made it less enjoyable to read.
the answers were too short, and made its BS stand out a little more. A good model, but not for detailed book summaries.

Shown, for each model

Number of seconds required to generate the answer
Sum of Subjective Usefulness\Quality Ratings
How many characters were generated?
Sum of context context chunks found in target range.
Number of qualities listed below found in text generated:
- Filler (Extra words with less value)
- Short (Too short, not enough to work with.)
- BS (Not from this book and not helpful.)
- Good BS (Not from the targeted section but valid.)

Model	Rating	Search Accuracy	Characters	Seconds	BS	Filler	Short	Good BS
hermes-trismegistus-mistral-7b	68	56	62141	298	3	4	0	6
synthia-7b-v2.0	63	59	28087	188	1	7	7	0
mistral-7b-instruct-v0.1	51	56	21131	144	3	0	17	1
collectivecognition-v1.1-mistral-7b	56	57	59453	377	3	10	0	0
kai-7b-instruct	44	56	21480	117	5	0	18	0

Summary Ranking

For this first round I split the chapter contents in to sections with a range of 900-14000 characters each (or 225-3500 tokens).

NOTE: Despite the numerous large context models being released, for now, I still believe smaller context results in better summaries. I don’t prefer any more than 2750 tokens (11000 characters) per summarization task.

- Still in the lead. It's verbose, with some filler. I can use these results.
- Pretty good, but too concise. Many of the answers were perfect, but 7 were too short\incomplete for use.
- Just too short.
- Just too short.
- Lots of garbage. Some of the summaries were super detailed and perfect, but over half of the responses were a set of questions based on the text, not a summary.

Not surprisingly, summaries performed much better than Q/A, since had a precisely targeted context.

Name	Score	Characters Generated	% Diff from OG	Seconds to Generate	Short	Garbage	BS	Fill	Questions	Detailed
hermes-trismegistus-mistral-7b	74	45870	-61	274	0	1	1	3	0	0
synthia-7b-v2.0	60	26849	-77	171	7	1	0	0	0	1
mistral-7b-instruct-v0.1	58	25797	-78	174	7	2	0	0	0	0
kai-7b-instruct	59	25057	-79	168	5	1	0	0	0	0
collectivecognition-v1.1-mistral-7b	31	29509	-75	214	0	1	1	2	17	8

Find the full data and rankings on or on GitHub: , .

Round 2: Summarization - Model Ranking

Again, I prefer Q8 versions of 7B models.

Finding that had been released was worth a new round of testing.I also decided to test the prompt style. PrivateGPT didn’t come packaged with the Mistral prompt, so I tried both of the defaults (llama2 and llama-index).

- This model had become my favorite, so I used it as a benchmark.
(Llama-index Prompt) Star of the show here, quite impressive.
(Llama2 Prompt) Still good, but not as good as using llama-index prompt
- Another by the same creator as Synthia v2. Good, but not as good.
- worked ok, but slowly, with llama-index prompt. Just bad with llama2 prompt. (Should test again with Llama2 "Instruct Only" style)

Summary Ranking

Only summaries, Q/A is just less efficient for book summarization.

Model	% Difference	Score	Comment
Synthia 7b V2	-64.43790093	28	Good
Mistral 7b Instruct v0.2 (Default Prompt)	-60.81878508	33	VGood
Mistral 7b Instruct v0.2 (Llama2 Prompt)	-64.5871483	28	Good
Tess 7b v1.4	-62.12938978	29	Less Structured
Llama 2 7b 32k Instruct (Default)	-61.39890553	27	Less Structured. Slow

Find the full data and rankings on or on .

Round 3: Prompt Style

In the previous round, I noticed was performing much better with default prompt than llama2.

Well, actually, the mistral prompt is quite similar to llama2, but not exactly the same.

llama_index (default)

system: {{systemPrompt}}
user: {{userInstructions}}
assistant: {{assistantResponse}}

llama2:

<s> [INST] <<SYS>>
 {{systemPrompt}}
<</SYS>>

 {{userInstructions}} [/INST]

mistral:

<s>[INST] {{systemPrompt}} [/INST]</s>[INST] {{userInstructions}} [/INST]

I began testing output with the default, then llama2 prompt styles. Next I went to work .

The results of that ranking gave me confidence that I coded correctly.

Prompt Style	% Difference	Score	Note
Mistral	-50%	51	Perfect!
Default (llama-index)	-42%	43	Bad headings
Llama2	-47%	48	No Structure

Find the full data and rankings on or on .

Round 4: System Prompts

Once I got the prompt style dialed in, I tried a few different system prompts, and was surprised by the result!

Name	System Prompt	Change	Score	Comment
None		-49.8	51	Perfect
Default Prompt	You are a helpful, respectful and honest assistant. \nAlways answer as helpfully as possible and follow ALL given instructions. \nDo not speculate or make up information. \nDo not reference any given instructions or context."	-58.5	39	Less Nice
MyPrompt1	"You are Loved. Act as an expert on summarization, outlining and structuring. \nYour style of writing should be informative and logical."	-54.4	44	Less Nice
Simple	"You are a helpful AI assistant. Don't include any user instructions, or system context, as part of your output."	-52.5	42	Less Nice

In the end, I find that works best for my summaries without any system prompt.

Maybe would have different results for a different task, or maybe better prompting, but this works good so I'm not messing with it.

Find the full data and rankings on or on .

Round 5: User Prompt

What I already began to suspect is that I’m getting better results with less words in the prompt. Since I found the best system prompt, for , I also tested which user prompt suits it best.

	Prompt	vs OG	score	note
Propmt0	Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information.	43%	11
Prompt1	Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information.	46%	11	Extra Notes
Prompt2	Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	58%	15
Prompt3	Create concise bullet-point notes summarizing the important parts of the following text. Use nested bullet points, with headings terms and key concepts in bold, including whitespace to ensure readability. Avoid Repetition.	43%	10
Prompt4	Write concise notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	41%	14
Prompt5	Create comprehensive, but concise, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	52%	14	Extra Notes

Find the full data and rankings on or on .

Perhaps with more powerful hardware that can support 11b or 30b models I would get better results with more descriptive prompting. Even with Mistral 7b Instruct v0.2 I’m still open to trying some creative instructions, but for now I’m just happy to refine my existing process.

Prompt2: Wins!

Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.

In this case, comprehensive performs better than "concise", or even than "comprehensive, but concise".

However, I do caution that this will depend on your use-case. What I'm looking for is a highly condensed and readable notes covering the important knowledge.

Essentially, if I didn't read the original, I should still know what information it conveys, if not every specific detail. Even if I did read the original, I’m not going to remember the majority, later on. These notes are a quick reference to the main topics.

Result

Using knowledge gained from these tests, I summarized my first complete book, 539 pages in 5-6 hours!!! Incredible!

Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days.

Plagiarism

You can see the results from below for each of the texts published, here. Especially considering that this is not for profit, but for educational purposes, I believe these numbers are acceptable.

Book	Models	Character Difference	Identical	Minor changes	Paraphrased	Total Matched
Eastern Body Western Mind	Synthia 7Bv2	-75%	3.5%	1.1%	0.8%	5.4%
Healing Power Vagus Nerve	Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0	-81%	1.2%	0.8%	2.5%	4.5%
Ayurveda and the Mind	Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0	-77%	0.5%	0.3%	1.2%	2%
Healing the Fragmented Selves of Trauma Survivors	Mistral-7B-Instruct-v0.2	-75%				2%
A Secure Base	Mistral-7B-Instruct-v0.2	-84%	0.3%	0.1%	0.3%	0.7%
The Body Keeps the Score	Mistral-7B-Instruct-v0.2	-74%	0.1%	0.2%	0.3%	0.5%
Complete Book of Chakras	Mistral-7B-Instruct-v0.2	-70%	0.3%	0.3%	0.4%	1.1%
50 Years of Attachment Theory	Mistral-7B-Instruct-v0.2	-70%	1.1%	0.4%	2.1%	3.7%
Attachment Disturbances in Adults	Mistral-7B-Instruct-v0.2	-62%	1.1%	1.2%	0.7%	3.1%
Psychology Major's Companion	Mistral-7B-Instruct-v0.2	-62%	1.3%	1.2%	0.4%	2.9%
Psychology in Your Life	Mistral-7B-Instruct-v0.2	-74%	0.6%	0.4%	0.5%	1.6%

Completed Book Summaries

Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days. In parenthesis is the page count of the original.

Anodea Judith (436 pages)
Stanley Rosenberg (335 Pages)
Dr. David Frawley (181 Pages)
Janina Fisher (367 Pages)
John Bowlby (133 Pages)
Bessel van der Kolk (454 Pages)
Steven Porges (37 pages)
Llewellyn's Complete Book of Chakras Cynthia Dale (999 pages)
- s
(54 pages)
(477 Pages)
Dana S. Dunn, Jane S. Halonen (308 Pages)
Walter Wink (5 Pages)
Sarah Gison and Michael S. Gazzaniga (1072 Pages)

Walkthrough

If you are interested to follow my steps more closely, check out the containing scripts and examples.

Conclusion

Now that I have my processes refined, and feel confident working with prompt formats, I will conduct further tests. In fact, i already have conducted further tests and rankings (will publish those next), but of course will do more tests again and continue learning!

I still believe if you want to get the best results for whatever task you perform with AI, you ought to run your own experiments and see what works best. Don’t rely solely on popular model rankings, but use them to guide your own research.