visit
Developers are the architects. They construct the framework of the application, ensuring the RAG + LLM chain is seamlessly integrated and can navigate through different scenarios effortlessly.
Prompt Engineers are the creatives. They devise scenarios and prompts that emulate real-world user interactions. They ponder the "what ifs" and push the system to deal with a broad spectrum of topics and questions.
Data Scientists are the strategists. They analyze the responses, delve into the data, and wield their statistical expertise to assess whether the AI's performance meets the mark.
Generating a Relevant Dataset: Start by creating a dataset that reflects the nuances of your domain. This dataset could be curated by experts or synthesized with the help of GPT-4 to save time, ensuring it matches your gold standard.
Defining Metrics for Success: Leverage the strengths of the master LLM to assist in defining your metrics. You have the liberty to choose metrics that best fit your goals, given that the master LLM can handle the more complex tasks. In the community standard, you may want to see some work from and some other libraries like . They have some metrics like Faithfulness, context recall, context precision, answer similarity, etc.
Automating Your Evaluation Pipeline: To keep pace with rapid development cycles, establish an automated pipeline. This will consistently assess the application's performance against your predefined metrics following each update or change. By automating the process, you ensure that your evaluation is not only thorough but also efficiently iterative, allowing for swift optimization and refinement.
import os
import json
import pandas as pd
from dataclasses import dataclass
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.output_parsers import JsonOutputToolsParser, PydanticOutputParser
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
QA_DATASET_GENERATION_PROMPT = PromptTemplate.from_template(
"You are an expert on generate question-and-answer dataset based on a given context. You are given a context. "
"Your task is to generate a question and answer based on the context. The generated question should be able to"
" to answer by leverage the given context. And the generated question-and-answer pair must be grammatically "
"and semantically correct. Your response must be in a json format with 2 keys: question, answer. For example,"
"\n\n"
"Context: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower."
"\n\n"
"Response: {{"
"\n"
" \"question\": \"Where is France and what is it’s capital?\","
"\n"
" \"answer\": \"France is in Western Europe and it’s capital is Paris.\""
"\n"
"}}"
"\n\n"
"Context: The University of California, Berkeley is a public land-grant research university in Berkeley, California. Established in 1868 as the state's first land-grant university, it was the first campus of the University of California system and a founding member of the Association of American Universities."
"\n\n"
"Response: {{"
"\n"
" \"question\": \"When was the University of California, Berkeley established?\","
"\n"
" \"answer\": \"The University of California, Berkeley was established in 1868.\""
"\n"
"}}"
"\n\n"
"Now your task is to generate a question-and-answer dataset based on the following context:"
"\n\n"
"Context: {context}"
"\n\n"
"Response: ",
)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if OPENAI_API_KEY is None:
raise ValueError("OPENAI_API_KEY is not set")
llm = ChatOpenAI(
model="gpt-4-1106-preview",
api_key=OPENAI_API_KEY,
temperature=0.7,
response_format={
"type": "json_object"
},
)
chain = LLMChain(
prompt=QA_DATASET_GENERATION_PROMPT,
llm=llm
)
file_loader = PyPDFLoader("./data/cidr_lakehouse.pdf")
text_splitter = CharacterTextSplitter(chunk_size=1000)
chunks = text_splitter.split_documents(file_loader.load())
questions, answers = [], []
for chunk in chunks:
for _ in range(2):
response = chain.invoke({
"context": chunk
})
obj = json.loads(response["text"])
questions.append(obj["question"])
answers.append(obj["answer"])
df = pd.DataFrame({
"question": questions,
"answer": answers
})
df.to_csv("./data/cidr_lakehouse_qa.csv", index=False)
from tqdm import tqdm
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOllama
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
vector_store = FAISS.from_documents(chunks, HuggingFaceEmbeddings())
retriever = vector_store.as_retriever()
def test_local_retrieval_qa(model: str):
chain = RetrievalQA.from_llm(
llm=ChatOllama(model=model),
retriever=retriever,
)
predictions = []
for it, row in tqdm(df.iterrows(), total=len(df)):
resp = chain.invoke({
"query": row["question"]
})
predictions.append(resp["result"])
df[f"{model}_result"] = predictions
test_local_retrieval_qa("mistral")
test_local_retrieval_qa("llama2")
test_local_retrieval_qa("zephyr")
test_local_retrieval_qa("orca-mini")
test_local_retrieval_qa("phi")
df.to_csv("./data/cidr_lakehouse_qa_retrieval_prediction.csv", index=False)
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if OPENAI_API_KEY is None:
raise ValueError("OPENAI_API_KEY is not set")
CORRECTNESS_PROMPT = PromptTemplate.from_template(
"""
Extract following from given question and ground truth. Your response must be in a json format with 3 keys and does not need to be in any specific order:
- statements that are present in both the answer and the ground truth
- statements present in the answer but not found in the ground truth
- relevant statements found in the ground truth but omitted in the answer
Please be concise and do not include any unnecessary information. You should classify the statements as claims, facts, or opinions with semantic matching, no need exact word-by-word matching.
Question:What powers the sun and what is its primary function?
Answer: The sun is powered by nuclear fission, similar to nuclear reactors on Earth, and its primary function is to provide light to the solar system.
Ground truth: The sun is actually powered by nuclear fusion, not fission. In its core, hydrogen atoms fuse to form helium, releasing a tremendous amount of energy. This energy is what lights up the sun and provides heat and light, essential for life on Earth. The sun's light also plays a critical role in Earth's climate system and helps to drive the weather and ocean currents.
Extracted statements:
[
{{
"statements that are present in both the answer and the ground truth": ["The sun's primary function is to provide light"],
"statements present in the answer but not found in the ground truth": ["The sun is powered by nuclear fission", "similar to nuclear reactors on Earth"],
"relevant statements found in the ground truth but omitted in the answer": ["The sun is powered by nuclear fusion, not fission", "In its core, hydrogen atoms fuse to form helium, releasing a tremendous amount of energy", "This energy provides heat and light, essential for life on Earth", "The sun's light plays a critical role in Earth's climate system", "The sun helps to drive the weather and ocean currents"]
}}
]
Question: What is the boiling point of water?
Answer: The boiling point of water is 100 degrees Celsius at sea level.
Ground truth: The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit) at sea level, but it can change with altitude.
Extracted statements:
[
{{
"statements that are present in both the answer and the ground truth": ["The boiling point of water is 100 degrees Celsius at sea level"],
"statements present in the answer but not found in the ground truth": [],
"relevant statements found in the ground truth but omitted in the answer": ["The boiling point can change with altitude", "The boiling point of water is 212 degrees Fahrenheit at sea level"]
}}
]
Question: {question}
Answer: {answer}
Ground truth: {ground_truth}
Extracted statements:""",
)
judy_llm = ChatOpenAI(
model="gpt-4-1106-preview",
api_key=OPENAI_API_KEY,
temperature=0.0,
response_format={
"type": "json_object"
},
)
judy_chain = LLMChain(
prompt=CORRECTNESS_PROMPT,
llm=judy_llm
)
def evaluate_correctness(column_name: str):
chain = LLMChain(
prompt=CORRECTNESS_PROMPT,
llm=ChatOpenAI(
model="gpt-4-1106-preview",
api_key=OPENAI_API_KEY,
temperature=0.0,
response_format={
"type": "json_object"
},
)
)
key_map = {
"TP": "statements that are present in both the answer and the ground truth",
"FP": "statements present in the answer but not found in the ground truth",
"FN": "relevant statements found in the ground truth but omitted in the answer", # noqa: E501
}
TP, FP, FN = [], [], []
for it, row in tqdm(df.iterrows(), total=len(df)):
resp = chain.invoke({
"question": row["question"],
"answer": row[column_name],
"ground_truth": row["answer"]
})
obj = json.loads(resp["text"])
TP.append(len(obj[key_map["TP"]]))
FP.append(len(obj[key_map["FP"]]))
FN.append(len(obj[key_map["FN"]]))
# convert to numpy array
TP = np.array(TP)
FP = np.array(FP)
FN = np.array(FN)
df[f"{column_name}_recall"] = TP / (TP + FN)
df[f"{column_name}_precision"] = TP / (TP + FP)
df[f"{column_name}_correctness"] = 2 * df[f"{column_name}_recall"] * df[f"{column_name}_precision"] / (df[f"{column_name}_recall"] + df[f"{column_name}_precision"])
evaluate_correctness("mistral_result")
evaluate_correctness("llama2_result")
evaluate_correctness("zephyr_result")
evaluate_correctness("orca-mini_result")
evaluate_correctness("phi_result")
print("|====Model====|=== Recall ===|== Precision ==|== Correctness ==|")
print(f"|mistral | {df['mistral_result_recall'].mean():.4f} | {df['mistral_result_precision'].mean():.4f} | {df['mistral_result_correctness'].mean():.4f} |")
print(f"|llama2 | {df['llama2_result_recall'].mean():.4f} | {df['llama2_result_precision'].mean():.4f} | {df['llama2_result_correctness'].mean():.4f} |")
print(f"|zephyr | {df['zephyr_result_recall'].mean():.4f} | {df['zephyr_result_precision'].mean():.4f} | {df['zephyr_result_correctness'].mean():.4f} |")
print(f"|orca-mini | {df['orca-mini_result_recall'].mean():.4f} | {df['orca-mini_result_precision'].mean():.4f} | {df['orca-mini_result_correctness'].mean():.4f} |")
print(f"|phi | {df['phi_result_recall'].mean():.4f} | {df['phi_result_precision'].mean():.4f} | {df['phi_result_correctness'].mean():.4f} |")
print("|==============================================================|")
df.to_csv("./data/cidr_lakehouse_qa_retrieval_prediction_correctness.csv", index=False)
Transitioning into the Operation Phase is like moving from dress rehearsals to opening night. Here, our RAG + LLM applications are no longer hypothetical entities; they become active participants in the daily workflows of real users. This phase is the litmus test for all the preparation and fine-tuning done in the development phase.
A/B Testing Framework: We split our user base into two segments – the control segment, which continues to use the established version of the application (Version 1), and the test segment, which tries out the new features in Version 2 (actually you can also run multiple A/B tests at the same time). This allows us to gather comparative data on user experience, feature receptivity, and overall performance.
Operational Rollout: The operations team is tasked with the smooth rollout of both versions, ensuring that the infrastructure is robust and that any version transitions are seamless for the user.
Product Evolution: The product team, with its finger on the pulse of user feedback, works to iterate the product. This team ensures that the new features align with user needs and the overall product vision.
Analytical Insights: The analyst team rigorously examines the data collected from the A/B test. Their insights are critical in determining whether the new version outperforms the old and if it's ready for a wider release.
Performance Metrics: Key performance indicators (KPIs) are monitored to measure the success of each version. These include user engagement metrics, satisfaction scores, and the accuracy of the application's outputs.