1,615 讀數

如何构建您自己的语音助手并使用 Whisper + Ollama + Bark 在本地运行

通过 Duy Huynh13m2024/04/02

太長; 讀書

基于语音的交互：用户可以开始和停止录制语音输入，助手通过播放生成的音频进行响应。对话上下文：助手维护对话的上下文，从而实现更连贯和相关的响应。 Llama-2 语言模型的使用使助手能够提供简洁且集中的响应。

featured image - 如何构建您自己的语音助手并使用 Whisper + Ollama + Bark 在本地运行

在我非常近收录管于如此引入自已的 RAG 并在本市正常运作它的话题此后，下面，企业更进一部，并不是做到了新型语言的模板的情景对话力量，还新增了听力考试和雅思口语力量。这位见解很容易易行：企业将创办一些视频语音帮手，引人记起标示性钢材侠电视剧中的贾维斯或周五，它还可以在您的换算机子联网正常运作。

主要是因为那是入门步骤步骤，我将适用 Python 实现目标它，并使其够很简单，適合初学生。到最后，我将提供了部分管于怎样才能扩容运用子程序的免费指导。

科技栈

先是，你必须設置一家虚拟软件 Python 环境。你有些应用设置，涉及到 pyenv、virtualenv、poetry 和某个极具累似贷款用途的应用设置。就我人个一般而言，我将在本教学视频中动用 Poetry，鉴于我的人个风险偏好。一下是你歌词需要组装的些主要库：

：为了获得视觉上吸引人的控制台输出。
：一种强大的语音到文本转换工具。
：一个尖端的文本到语音合成库，可确保高质量的音频输出。
：一个用于与大型语言模型（LLM）交互的简单库。
、和：对于音频录制和播放至关重要。

有关信任项的详细分析所有，请参阅超链接。

这里最关键的组件是大型语言模型 (LLM) 后端，我们将使用 Ollama。Ollama 被广泛认为是一种流行的离线运行和服务 LLM 的工具。如果您不熟悉，我建议您查看我之前关于离线 RAG 的文章：基本上，您只需下载 Ollama 应用程序，提取您喜欢的模型，然后运行它即可。

建筑学

好的，如果你一切的早已设施好，使公司马上下一次。下列是公司用程度的总体布局框架，它常见上包涵 3 个通常引擎：

语音识别：利用，我们将口语转换为文本。Whisper 对各种数据集的训练确保了其对各种语言和方言的熟练掌握。

对话链：对于对话功能，我们将使用模型的 Langchain 接口，该接口由 Ollama 提供。此设置可确保无缝且引人入胜的对话流程。

语音合成器：文本到语音的转换是通过实现的，Bark 是 Suno AI 推出的最先进的模型，以逼真的语音生成而闻名。

操作方法不难化：录屏qq语音、转录为文内容、选择 LLM 产生没有崩溃，如果选择 Bark 发布没有崩溃的心声。

Whisper、Ollama 和 Bark 语音视频辅助软件的编码序列图。

执行

实现首先要基于 Bark 制作一个TextToSpeechService ，结合从文本合成语音的方法以及无缝处理较长的文本输入的方法，如下所示：

 import nltk import torch import warnings import numpy as np from transformers import AutoProcessor, BarkModel warnings.filterwarnings( "ignore", message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.", ) class TextToSpeechService: def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"): """ Initializes the TextToSpeechService class. Args: device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu". Defaults to "cuda" if available, otherwise "cpu". """ self.device = device self.processor = AutoProcessor.from_pretrained("suno/bark-small") self.model = BarkModel.from_pretrained("suno/bark-small") self.model.to(self.device) def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): audio_array = self.model.generate(**inputs, pad_token_id=10000) audio_array = audio_array.cpu().numpy().squeeze() sample_rate = self.model.generation_config.sample_rate return sample_rate, audio_array def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given long-form text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ pieces = [] sentences = nltk.sent_tokenize(text) silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate)) for sent in sentences: sample_rate, audio_array = self.synthesize(sent, voice_preset) pieces += [audio_array, silence.copy()] return self.model.generation_config.sample_rate, np.concatenate(pieces)

初始化 ( __init__ ) ：该类采用可选的device参数，该参数指定要用于模型的设备（如果有 GPU，则为cuda ，否则为cpu ）。它从suno/bark-small预训练模型加载 Bark 模型和相应的处理器。您还可以通过为模型加载器指定suno/bark来使用大型版本。

合成 ( synthesize ) ：此方法接受text输入和voice_preset参数，该参数指定用于合成的语音。您可以查看其他voice_preset值。它使用processor准备输入文本和语音预设，然后使用model.generate()方法生成音频数组。生成的音频数组将转换为 NumPy 数组，并将采样率与音频数组一起返回。

长格式合成 ( long_form_synthesize ) ：此方法用于合成较长的文本输入。它首先使用nltk.sent_tokenize函数将输入文本标记为句子。对于每个句子，它调用synthesize方法来生成音频数组。然后，它将生成的音频数组连接起来，并在每个句子之间添加短暂的静音（0.25 秒）。

现在我们已经设置了TextToSpeechService ，我们需要为大型语言模型 (LLM) 服务准备 Ollama 服务器。为此，您需要遵循以下步骤：

拉取最新的 Llama-2 模型：运行以下命令从 Ollama 存储库下载最新的 Llama-2 模型： ollama pull llama2 。

启动 Ollama 服务器：如果服务器尚未启动，请执行以下命令启动它： ollama serve 。

完整一些步驟后，您的APP源程序将要实用 Ollama 服务质量器和 Llama-2 实体模型来转换成对移动用户进入的初始化失败。

现在来，当他们将转到主耍利用程序代码思维逻辑。第一，当他们可以原始化下述零件：

丰富的控制台：我们将使用丰富的库为终端内的用户创建更好的交互式控制台。

Whisper 语音转文本：我们将初始化 Whisper 语音识别模型，这是 OpenAI 开发的最先进的开源语音识别系统。我们将使用基础英语模型 ( base.en ) 转录用户输入。

Bark 文本到语音：我们将初始化一个 Bark 文本到语音合成器实例，该实例已在上面实现。

对话链：我们将使用 Langchain 库中的内置ConversationalChain ，它提供了管理对话流的模板。我们将配置它以使用 Llama-2 语言模型和 Ollama 后端。

 import time import threading import numpy as np import whisper import sounddevice as sd from queue import Queue from rich.console import Console from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain.prompts import PromptTemplate from langchain_community.llms import Ollama from tts import TextToSpeechService console = Console() stt = whisper.load_model("base.en") tts = TextToSpeechService() template = """ You are a helpful and friendly AI assistant. You are polite, respectful, and aim to provide concise responses of less than 20 words. The conversation transcript is as follows: {history} And here is the user's follow-up: {input} Your response: """ PROMPT = PromptTemplate(input_variables=["history", "input"], template=template) chain = ConversationChain( prompt=PROMPT, verbose=False, memory=ConversationBufferMemory(ai_prefix="Assistant:"), llm=Ollama(), )

如今，我就们的定义必须的函数公式：

record_audio ：此函数在单独的线程中运行，使用sounddevice.RawInputStream从用户的麦克风捕获音频数据。每当有新的音频数据可用时，就会调用回调函数，并将数据放入data_queue以供进一步处理。

transcribe ：该函数利用 Whisper 实例将data_queue中的音频数据转录为文本。

get_llm_response ：此函数将当前对话上下文提供给 Llama-2 语言模型（通过 Langchain ConversationalChain ）并检索生成的文本响应。

play_audio ：此函数采用 Bark 文本转语音引擎生成的音频波形，并使用声音播放库（例如sounddevice ）将其播放给用户。

 def record_audio(stop_event, data_queue): """ Captures audio data from the user's microphone and adds it to a queue for further processing. Args: stop_event (threading.Event): An event that, when set, signals the function to stop recording. data_queue (queue.Queue): A queue to which the recorded audio data will be added. Returns: None """ def callback(indata, frames, time, status): if status: console.print(status) data_queue.put(bytes(indata)) with sd.RawInputStream( samplerate=16000, dtype="int16", channels=1, callback=callback ): while not stop_event.is_set(): time.sleep(0.1) def transcribe(audio_np: np.ndarray) -> str: """ Transcribes the given audio data using the Whisper speech recognition model. Args: audio_np (numpy.ndarray): The audio data to be transcribed. Returns: str: The transcribed text. """ result = stt.transcribe(audio_np, fp16=False) # Set fp16=True if using a GPU text = result["text"].strip() return text def get_llm_response(text: str) -> str: """ Generates a response to the given text using the Llama-2 language model. Args: text (str): The input text to be processed. Returns: str: The generated response. """ response = chain.predict(input=text) if response.startswith("Assistant:"): response = response[len("Assistant:") :].strip() return response def play_audio(sample_rate, audio_array): """ Plays the given audio data using the sounddevice library. Args: sample_rate (int): The sample rate of the audio data. audio_array (numpy.ndarray): The audio data to be played. Returns: None """ sd.play(audio_array, sample_rate) sd.wait()

然后呢，我们都判定主用系统软件重复。主用系统软件重复教育引导朋友做好交流交互性，下列右图：

系统提示朋友按 Enter 着手纪录她们的导入。
一旦用户按下 Enter 键，就会在单独的线程中调用record_audio函数来捕获用户的音频输入。
当用户再次按下 Enter 停止录音时，音频数据将使用transcribe功能进行转录。
然后将转录的文本传递给get_llm_response函数，该函数使用 Llama-2 语言模型生成响应。
生成的响应被打印到控制台并使用play_audio函数播放给用户。

 if __name__ == "__main__": console.print("[cyan]Assistant started! Press Ctrl+C to exit.") try: while True: console.input( "Press Enter to start recording, then press Enter again to stop." ) data_queue = Queue() # type: ignore[var-annotated] stop_event = threading.Event() recording_thread = threading.Thread( target=record_audio, args=(stop_event, data_queue), ) recording_thread.start() input() stop_event.set() recording_thread.join() audio_data = b"".join(list(data_queue.queue)) audio_np = ( np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 ) if audio_np.size > 0: with console.status("Transcribing...", spinner="earth"): text = transcribe(audio_np) console.print(f"[yellow]You: {text}") with console.status("Generating response...", spinner="earth"): response = get_llm_response(text) sample_rate, audio_array = tts.long_form_synthesize(response) console.print(f"[cyan]Assistant: {response}") play_audio(sample_rate, audio_array) else: console.print( "[red]No audio recorded. Please ensure your microphone is working." ) except KeyboardInterrupt: console.print("\n[red]Exiting...") console.print("[blue]Session ended.")

结果

往往需备就绪后，我就就可以进行该应运编译源程序流程，如下面的短视頻如下图所示。仍然 Bark 沙盘模型非常大，但是是较小旧版本，该应运编译源程序流程在我的 MacBook 上进行高快慢也等同于慢。从而，我有点儿加快快慢了短视頻高快慢。相对于的使用不支持 CUDA 的计算出机的用户数，它有机会会进行得更好。低于是我们应运编译源程序流程的具体作用：

基于语音的交互：用户可以开始和停止录制他们的语音输入，助手通过播放生成的音频来做出响应。

对话上下文：助手保留对话上下文，从而能够做出更连贯、更相关的响应。使用 Llama-2 语言模型，助手能够提供简洁、有针对性的响应。

就喜欢将此app系统软件升降到生产加工就绪阶段的手机用户，觉得实施这资料：

性能优化：结合模型的优化版本，例如 whisper.cpp、llama.cpp 和 bark.cpp，旨在提高性能，尤其是在低端计算机上。

可定制的机器人提示：实施一个系统，允许用户自定义机器人的角色和提示，从而可以创建不同类型的助手（例如个人、专业或特定领域）。

图形用户界面 (GUI) ：开发用户友好的 GUI 以增强整体用户体验，使应用程序更易于访问且更具视觉吸引力。

多模式功能：扩展应用程序以支持多模式交互，例如除了基于语音的响应之外，还能够生成和显示图像、图表或其他视觉内容。

第四，你们来完成了轻松的微信语音视频视频助手安卓版应用系统程序，完成编号可在这网站地址查找：。微信语音视频视频识别系统、语言表达模型场景和文本格式转微信语音视频视频高技术的创设展览了你们该如何创设听下来无法但真实上就能够在预估的机子电脑运行的产品。让你们获得打码的挑戰，别忘记了订阅关注这些就就不会等到一览表的劳动力智慧和编译程序论文。

也发布

L O A D I N G
. . . comments & more!