在大多数情况下,公司通过简单地直接查询 OLTP 数据库来保持非常基本的搜索级别。请求可能如下所示: SELECT id, title FROM, entities WHERE, description LIKE '%bow%
。
这些在处理复杂查询和使用过滤器方面往往相对更快并且更好。但是一旦将这些搜索引擎与谷歌搜索或 DuckDuckGo 等同类搜索引擎进行比较,就会发现开源解决方案无法构建适当的搜索上下文和查询模式——如果用户提供模糊的搜索请求,它们将无法理解查询。
这是因为从查询中提取意义是一项自然语言任务,没有 AI 组件就不太可能解决。 GPT-3擅长这项任务。
OpenAI 提供
text-search-ada-doc-001: 1024 text-search-babbage-doc-001: 2048 text-search-curie-doc-001: 4096 text-search-davinci-doc-001: 12288
文档通常很长,查询通常很短且不完整。因此,考虑到内容的密度和大小,任何文档的矢量化都与任何查询的矢量化有很大不同。 OpenAI 知道这一点,因此他们提供了两个配对模型, -doc
和-query
:
text-search-ada-query-001: 1024 text-search-babbage-query-001: 2048 text-search-curie-queryc-001: 4096 text-search-davinci-query-001: 12288
重要的是要注意查询和文档必须都使用相同的模型系列并且具有相同的输出向量长度。
通过示例可能最容易观察和理解此搜索解决方案的强大功能。对于这个例子,让我们借鉴
数据集包含大量列,但我们的矢量化过程将仅围绕标题和概述列构建。
Title: Harry Potter and the Half-Blood Prince Overview: As Harry begins his sixth year at Hogwarts, he discovers an old book marked as 'Property of the Half-Blood Prince', and begins to learn more about Lord Voldemort's dark past.
datafile_path = "./tmdb_5000_movies.csv" df = pd.read_csv(datafile_path) def combined_info(row): columns = ['title', 'overview'] columns_to_join = [f"{column.capitalize()}: {row[column]}" for column in columns] return '\n'.join(columns_to_join) df['combined_info'] = df.apply(lambda row: combined_info(row), axis=1)
def get_embedding(text, model="text-search-babbage-doc-001"): text = text.replace("\n", " ") return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding'] get_embedding(df['combined_info'][0])
此代码块输出一个列表,其大小等于模型正在运行的参数,在text-search-babbage-doc-001
情况下为 2048。
df['combined_info_search'] = df['combined_info'].apply(lambda x: get_embedding(x, model='text-search-babbage-doc-001')) df.to_csv('./tmdb_5000_movies_search.csv', index=False)
列combined_info_search
将保存 combined_text 的矢量表示。
from openai.embeddings_utils import get_embedding, cosine_similarity def search_movies(df, query, n=3, pprint=True): embedding = get_embedding( query, engine="text-search-babbage-query-001" ) df["similarities"] = df.combined_info_search.apply(lambda x: cosine_similarity(x, embedding)) res = ( df.sort_values("similarities", ascending=False) .head(n) .combined_info ) if pprint: for r in res: print(r[:200]) print() return res res = search_movies(df, "movie about the wizardry school", n=3)
Title: Harry Potter and the Philosopher's StoneOverview: Harry Potter has lived under the stairs at his aunt and uncle's house his whole life. But on his 11th birthday, he learns he's a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school's kindly headmaster, Harry uncovers the truth about his parents' deaths — and about the villain who's to blame. Title: Harry Potter and the Goblet of FireOverview: Harry starts his fourth year at Hogwarts, competes in the treacherous Triwizard Tournament and faces the evil Lord Voldemort. Ron and Hermione help Harry manage the pressure — but Voldemort lurks, awaiting his chance to destroy Harry and all that he stands for. Title: Harry Potter and the Prisoner of AzkabanOverview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black — and turns to sympathetic Professor Lupin for help.
Title: Harry Potter and the Philosopher's StoneOverview: Harry Potter has lived under the stairs at his aunt and uncle's house his whole life. But on his 11th birthday, he learns he's a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school's kindly headmaster, Harry uncovers the truth about his parents' deaths — and about the villain who's to blame. Title: Dumb and Dumberer: When Harry Met LloydOverview: This wacky prequel to the 1994 blockbuster goes back to the lame-brained Harry and Lloyd's days as classmates at a Rhode Island high school, where the unprincipled principal puts the pair in remedial courses as part of a scheme to fleece the school. Title: Harry Potter and the Prisoner of AzkabanOverview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black — and turns to sympathetic Professor Lupin for help.
您可能想知道“阿呆与笨蛋:当哈利遇见劳埃德”在这里做什么?值得庆幸的是,这个问题没有在参数更多的参数上重现。
在上述示例中,我们使用了
距离计算算法倾向于用单个数字表示查询和文档之间的这种相似性(差异)。然而,我们不能依赖
蛮力解决方案的搜索时间复杂度为O(n * d * n * log(n)) 。参数d取决于模型(在 Babbage 的情况下,它等于 2048),而由于排序步骤,我们有O(nlog(n))块。
在搜索解决方案中,查询应该尽快完成。而这个代价通常是在训练和预缓存时间方面付出的。我们可以使用像 kd 树、r 树或 ball 树这样的数据结构。考虑来自的文章
Text-search-babbage-query-001 Number of dimensions: 2048 Number of queries: 100 Average duration: 225ms Median duration: 207ms Max duration: 1301ms Min duration: 176ms
Text-search-ada-query-002 Number of dimensions: 1536 Number of queries: 100 Average duration: 264ms Median duration: 250ms Max duration: 859ms Min duration: 215ms
Text-search-davinci-query-001 Number of dimensions: 12288 Number of queries: 100 Average duration: 379ms Median duration: 364ms Max duration: 1161ms Min duration: 271ms
参考
或者,您可以考虑尝试
根据 ElasticSearch 的
此外,还值得补充的是,微软最近 ____ Bing 搜索引擎现在使用 GPT 3.5 的新升级版本,称为“Prometheus”,最初是为搜索开发的。根据公告,新的 Prometheus 语言模型允许 Bing 增加相关性、更准确地注释片段、提供更新鲜的结果、理解地理定位并提高安全性。这可能会为将自回归语言模型用于搜索解决方案开辟新的可能性,我们一定会继续关注这一点。
参考: