Jan 01, 1970
这里最关键的组件是大型语言模型 (LLM) 后端,我们将使用 Ollama。Ollama 被广泛认为是一种流行的离线运行和服务 LLM 的工具。如果您不熟悉 ,我建议您查看我之前关于离线 RAG 的文章: 基本上,您只需下载 Ollama 应用程序,提取您喜欢的模型,然后运行它即可。
Whisper、Ollama 和 Bark 语音视频辅助软件的编码序列图。
实现首先要基于 Bark 制作一个TextToSpeechService
,结合从文本合成语音的方法以及无缝处理较长的文本输入的方法,如下所示:
import nltk import torch import warnings import numpy as np from transformers import AutoProcessor, BarkModel warnings.filterwarnings( "ignore", message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.", ) class TextToSpeechService: def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"): """ Initializes the TextToSpeechService class. Args: device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu". Defaults to "cuda" if available, otherwise "cpu". """ self.device = device self.processor = AutoProcessor.from_pretrained("suno/bark-small") self.model = BarkModel.from_pretrained("suno/bark-small") self.model.to(self.device) def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): audio_array = self.model.generate(**inputs, pad_token_id=10000) audio_array = audio_array.cpu().numpy().squeeze() sample_rate = self.model.generation_config.sample_rate return sample_rate, audio_array def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given long-form text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ pieces = [] sentences = nltk.sent_tokenize(text) silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate)) for sent in sentences: sample_rate, audio_array = self.synthesize(sent, voice_preset) pieces += [audio_array, silence.copy()] return self.model.generation_config.sample_rate, np.concatenate(pieces)
__init__
) :该类采用可选的device
参数,该参数指定要用于模型的设备(如果有 GPU,则为cuda
,否则为cpu
)。它从suno/bark-small
预训练模型加载 Bark 模型和相应的处理器。您还可以通过为模型加载器指定suno/bark
来使用大型版本。
synthesize
) :此方法接受text
输入和voice_preset
参数,该参数指定用于合成的语音。您可以查看其他voice_preset
值。它使用processor
准备输入文本和语音预设,然后使用model.generate()
方法生成音频数组。生成的音频数组将转换为 NumPy 数组,并将采样率与音频数组一起返回。
long_form_synthesize
) :此方法用于合成较长的文本输入。它首先使用nltk.sent_tokenize
函数将输入文本标记为句子。对于每个句子,它调用synthesize
方法来生成音频数组。然后,它将生成的音频数组连接起来,并在每个句子之间添加短暂的静音(0.25 秒)。
现在我们已经设置了TextToSpeechService
,我们需要为大型语言模型 (LLM) 服务准备 Ollama 服务器。为此,您需要遵循以下步骤:
ollama pull llama2
。
ollama serve
。
base.en
) 转录用户输入。
ConversationalChain
,它提供了管理对话流的模板。我们将配置它以使用 Llama-2 语言模型和 Ollama 后端。 import time import threading import numpy as np import whisper import sounddevice as sd from queue import Queue from rich.console import Console from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain.prompts import PromptTemplate from langchain_community.llms import Ollama from tts import TextToSpeechService console = Console() stt = whisper.load_model("base.en") tts = TextToSpeechService() template = """ You are a helpful and friendly AI assistant. You are polite, respectful, and aim to provide concise responses of less than 20 words. The conversation transcript is as follows: {history} And here is the user's follow-up: {input} Your response: """ PROMPT = PromptTemplate(input_variables=["history", "input"], template=template) chain = ConversationChain( prompt=PROMPT, verbose=False, memory=ConversationBufferMemory(ai_prefix="Assistant:"), llm=Ollama(), )
如今,我就们的定义必须的函数公式:record_audio
:此函数在单独的线程中运行,使用sounddevice.RawInputStream
从用户的麦克风捕获音频数据。每当有新的音频数据可用时,就会调用回调函数,并将数据放入data_queue
以供进一步处理。
transcribe
:该函数利用 Whisper 实例将data_queue
中的音频数据转录为文本。
get_llm_response
:此函数将当前对话上下文提供给 Llama-2 语言模型(通过 Langchain ConversationalChain
)并检索生成的文本响应。
play_audio
:此函数采用 Bark 文本转语音引擎生成的音频波形,并使用声音播放库(例如sounddevice
)将其播放给用户。 def record_audio(stop_event, data_queue): """ Captures audio data from the user's microphone and adds it to a queue for further processing. Args: stop_event (threading.Event): An event that, when set, signals the function to stop recording. data_queue (queue.Queue): A queue to which the recorded audio data will be added. Returns: None """ def callback(indata, frames, time, status): if status: console.print(status) data_queue.put(bytes(indata)) with sd.RawInputStream( samplerate=16000, dtype="int16", channels=1, callback=callback ): while not stop_event.is_set(): time.sleep(0.1) def transcribe(audio_np: np.ndarray) -> str: """ Transcribes the given audio data using the Whisper speech recognition model. Args: audio_np (numpy.ndarray): The audio data to be transcribed. Returns: str: The transcribed text. """ result = stt.transcribe(audio_np, fp16=False) # Set fp16=True if using a GPU text = result["text"].strip() return text def get_llm_response(text: str) -> str: """ Generates a response to the given text using the Llama-2 language model. Args: text (str): The input text to be processed. Returns: str: The generated response. """ response = chain.predict(input=text) if response.startswith("Assistant:"): response = response[len("Assistant:") :].strip() return response def play_audio(sample_rate, audio_array): """ Plays the given audio data using the sounddevice library. Args: sample_rate (int): The sample rate of the audio data. audio_array (numpy.ndarray): The audio data to be played. Returns: None """ sd.play(audio_array, sample_rate) sd.wait()
然后呢,我们都判定主用系统软件重复。主用系统软件重复教育引导朋友做好交流交互性,下列右图:
一旦用户按下 Enter 键,就会在单独的线程中调用record_audio
函数来捕获用户的音频输入。
当用户再次按下 Enter 停止录音时,音频数据将使用transcribe
功能进行转录。
然后将转录的文本传递给get_llm_response
函数,该函数使用 Llama-2 语言模型生成响应。
生成的响应被打印到控制台并使用play_audio
函数播放给用户。
if __name__ == "__main__": console.print("[cyan]Assistant started! Press Ctrl+C to exit.") try: while True: console.input( "Press Enter to start recording, then press Enter again to stop." ) data_queue = Queue() # type: ignore[var-annotated] stop_event = threading.Event() recording_thread = threading.Thread( target=record_audio, args=(stop_event, data_queue), ) recording_thread.start() input() stop_event.set() recording_thread.join() audio_data = b"".join(list(data_queue.queue)) audio_np = ( np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 ) if audio_np.size > 0: with console.status("Transcribing...", spinner="earth"): text = transcribe(audio_np) console.print(f"[yellow]You: {text}") with console.status("Generating response...", spinner="earth"): response = get_llm_response(text) sample_rate, audio_array = tts.long_form_synthesize(response) console.print(f"[cyan]Assistant: {response}") play_audio(sample_rate, audio_array) else: console.print( "[red]No audio recorded. Please ensure your microphone is working." ) except KeyboardInterrupt: console.print("\n[red]Exiting...") console.print("[blue]Session ended.")
也发布