1,677 測定値

独自の音声アシスタントを構築し、Whisper + Ollama + Bark を使用してローカルで実行する方法

に Duy Huynh13m2024/04/02

長すぎる; 読むには

音声ベースのインタラクション: ユーザーは音声入力の録音を開始および停止でき、アシスタントは生成された音声を再生することで応答します。会話のコンテキスト: アシスタントは会話のコンテキストを維持し、より一貫性のある適切な応答を可能にします。 Llama-2 言語モデルを使用すると、アシスタントは簡潔で焦点を絞った応答を提供できます。

featured image - 独自の音声アシスタントを構築し、Whisper + Ollama + Bark を使用してローカルで実行する方法

独自の RAG を構築してローカルで実行する方法についての私の最新の投稿に続き、今日では、大規模な言語モデルの会話機能を実装するだけでなく、リスニングおよびスピーキング機能も追加することで、さらに一歩進めています。アイデアは単純です。象徴的な映画「アイアンマン」のジャービスやフライデーを彷彿とさせる、コンピューター上でオフラインで動作できる音声アシスタントを作成します。

これは入門チュートリアルであるため、Python で実装し、初心者向けに簡単にすることにします。最後に、アプリケーションを拡張する方法についていくつかのガイダンスを提供します。

テックスタック

まず、仮想 Python 環境をセットアップする必要があります。これには、pyenv、virtualenv、poetry、および同様の目的を果たすその他のオプションを含む、いくつかのオプションがあります。個人的には、個人的な好みにより、このチュートリアルでは Poetry を使用します。インストールする必要があるいくつかの重要なライブラリを次に示します。

: 視覚的に魅力的なコンソール出力用。
: 音声からテキストへの変換のための堅牢なツール。
: 高品質のオーディオ出力を保証する、テキスト読み上げ合成用の最先端のライブラリ。
: 大規模言語モデル (LLM) とインターフェイスするための簡単なライブラリ。
、、および : オーディオの録音と再生に不可欠です。

依存関係の詳細なリストについては、リンクを参照してください。

ここで最も重要なコンポーネントは大規模言語モデル (LLM) バックエンドであり、これには Ollama を使用します。、LLM をオフラインで実行および提供するための人気のあるツールとして広く認識されています。 Ollama を初めて使用する場合は、オフライン RAG に関する私の以前の記事基本的に、Ollama アプリケーションをダウンロードし、好みのモデルをプルして実行するだけです。

建築

すべての設定が完了したら、次のステップに進みます。以下はアプリケーションの全体的なアーキテクチャであり、基本的に 3 つの主要コンポーネントで構成されています。

音声認識: を利用して、話し言葉をテキストに変換します。 Whisper は多様なデータセットでトレーニングされているため、さまざまな言語や方言にわたる習熟度が保証されます。

会話チェーン: 会話機能については、Ollama を使用して提供されるモデルの Langchain インターフェイスを採用します。この設定により、シームレスで魅力的な会話の流れが約束されます。

音声合成: テキストから音声への変換は、本物のような音声生成で有名な Suno AI の最先端モデルであるを通じて実現されます。

ワークフローは単純です。音声を録音し、テキストに書き起こし、LLM を使用して応答を生成し、Bark を使用して応答を音声化します。

Whisper、Ollama、Bark を使用した音声アシスタントのシーケンス図。

実装

実装は、Bark に基づいてTextToSpeechServiceを作成することから始まり、テキストから音声を合成し、次のように長いテキスト入力をシームレスに処理するメソッドを組み込みます。

 import nltk import torch import warnings import numpy as np from transformers import AutoProcessor, BarkModel warnings.filterwarnings( "ignore", message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.", ) class TextToSpeechService: def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"): """ Initializes the TextToSpeechService class. Args: device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu". Defaults to "cuda" if available, otherwise "cpu". """ self.device = device self.processor = AutoProcessor.from_pretrained("suno/bark-small") self.model = BarkModel.from_pretrained("suno/bark-small") self.model.to(self.device) def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): audio_array = self.model.generate(**inputs, pad_token_id=10000) audio_array = audio_array.cpu().numpy().squeeze() sample_rate = self.model.generation_config.sample_rate return sample_rate, audio_array def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given long-form text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ pieces = [] sentences = nltk.sent_tokenize(text) silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate)) for sent in sentences: sample_rate, audio_array = self.synthesize(sent, voice_preset) pieces += [audio_array, silence.copy()] return self.model.generation_config.sample_rate, np.concatenate(pieces)

初期化 ( __init__ ) : クラスはオプションのdeviceパラメーターを受け取り、モデルに使用されるデバイス (GPU が使用可能な場合はcuda 、またはcpu ) を指定します。 Bark モデルと対応するプロセッサをsuno/bark-small事前トレーニング済みモデルからロードします。モデルローダーにsuno/barkを指定することで大きいバージョンを使用することもできます。

Synthesize ( synthesize ) : このメソッドはtext入力と、合成に使用する音声を指定するvoice_presetパラメーターを受け取ります。他のvoice_preset値を確認できます。 processorを使用して入力テキストと音声プリセットを準備し、 model.generate()メソッドを使用してオーディオ配列を生成します。生成されたオーディオ配列は NumPy 配列に変換され、サンプルレートがオーディオ配列とともに返されます。

長い形式の合成 ( long_form_synthesize ) : このメソッドは、長いテキスト入力を合成するために使用されます。まず、 nltk.sent_tokenize関数を使用して入力テキストを文にトークン化します。センテンスごとに、 synthesizeメソッドを呼び出してオーディオ配列を生成します。次に、生成された音声配列を連結し、各文の間に短い沈黙 (0.25 秒) を追加します。

TextToSpeechServiceのセットアップが完了したので、大規模言語モデル (LLM) を提供するために Ollama サーバーを準備する必要があります。これを行うには、次の手順に従う必要があります。

最新の Llama-2 モデルをプルする: 次のコマンドを実行して、Ollama リポジトリから最新の Llama-2 モデルをダウンロードします: ollama pull llama2 。

Ollama サーバーを起動します。サーバーがまだ起動していない場合は、コマンドollama serveを実行して起動します。

これらの手順を完了すると、アプリケーションは Ollama サーバーと Llama-2 モデルを使用してユーザー入力に対する応答を生成できるようになります。

次に、メインのアプリケーションロジックに移ります。まず、次のコンポーネントを初期化する必要があります。

リッチコンソール: リッチライブラリを使用して、端末内でユーザーのためのより優れた対話型コンソールを作成します。

Whisper Speech-to-Text : OpenAI によって開発された最先端のオープンソース音声認識システムである Whisper 音声認識モデルを初期化します。ユーザー入力の文字起こしには、基本英語モデル ( base.en ) を使用します。

Bark Text-to-Speech : 上記で実装された Bark Text-to-Speech シンセサイザーインスタンスを初期化します。

会話チェーン: 会話フローを管理するためのテンプレートを提供する、Langchain ライブラリの組み込みConversationalChainを使用します。 Ollama バックエンドで Llama-2 言語モデルを使用するように構成します。

 import time import threading import numpy as np import whisper import sounddevice as sd from queue import Queue from rich.console import Console from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain.prompts import PromptTemplate from langchain_community.llms import Ollama from tts import TextToSpeechService console = Console() stt = whisper.load_model("base.en") tts = TextToSpeechService() template = """ You are a helpful and friendly AI assistant. You are polite, respectful, and aim to provide concise responses of less than 20 words. The conversation transcript is as follows: {history} And here is the user's follow-up: {input} Your response: """ PROMPT = PromptTemplate(input_variables=["history", "input"], template=template) chain = ConversationChain( prompt=PROMPT, verbose=False, memory=ConversationBufferMemory(ai_prefix="Assistant:"), llm=Ollama(), )

次に、必要な関数を定義しましょう。

record_audio : この関数は別のスレッドで実行され、 sounddevice.RawInputStreamを使用してユーザーのマイクからオーディオデータをキャプチャします。コールバック関数は、新しいオーディオデータが利用可能になるたびに呼び出され、さらなる処理のためにデータをdata_queueに置きます。

transcribe : この関数は Whisper インスタンスを利用して、 data_queueからオーディオデータをテキストに転写します。

get_llm_response : この関数は、現在の会話コンテキストを Llama-2 言語モデルに (Langchain ConversationalChain経由で) フィードし、生成されたテキスト応答を取得します。

play_audio : この関数は、Bark テキスト読み上げエンジンによって生成されたオーディオ波形を取得し、サウンド再生ライブラリ ( sounddeviceなど) を使用してユーザーに再生します。

 def record_audio(stop_event, data_queue): """ Captures audio data from the user's microphone and adds it to a queue for further processing. Args: stop_event (threading.Event): An event that, when set, signals the function to stop recording. data_queue (queue.Queue): A queue to which the recorded audio data will be added. Returns: None """ def callback(indata, frames, time, status): if status: console.print(status) data_queue.put(bytes(indata)) with sd.RawInputStream( samplerate=16000, dtype="int16", channels=1, callback=callback ): while not stop_event.is_set(): time.sleep(0.1) def transcribe(audio_np: np.ndarray) -> str: """ Transcribes the given audio data using the Whisper speech recognition model. Args: audio_np (numpy.ndarray): The audio data to be transcribed. Returns: str: The transcribed text. """ result = stt.transcribe(audio_np, fp16=False) # Set fp16=True if using a GPU text = result["text"].strip() return text def get_llm_response(text: str) -> str: """ Generates a response to the given text using the Llama-2 language model. Args: text (str): The input text to be processed. Returns: str: The generated response. """ response = chain.predict(input=text) if response.startswith("Assistant:"): response = response[len("Assistant:") :].strip() return response def play_audio(sample_rate, audio_array): """ Plays the given audio data using the sounddevice library. Args: sample_rate (int): The sample rate of the audio data. audio_array (numpy.ndarray): The audio data to be played. Returns: None """ sd.play(audio_array, sample_rate) sd.wait()

次に、メインアプリケーションループを定義します。メインアプリケーションループは、次のように会話型の対話を通じてユーザーをガイドします。

ユーザーは Enter を押して入力の記録を開始するように求められます。
ユーザーが Enter キーを押すと、別のスレッドでrecord_audio関数が呼び出され、ユーザーのオーディオ入力がキャプチャされます。
ユーザーがもう一度 Enter キーを押して録音を停止すると、音声データはtranscribe機能を使用して転写されます。
転写されたテキストはget_llm_response関数に渡され、Llama-2 言語モデルを使用して応答が生成されます。
生成された応答はコンソールに出力され、 play_audio関数を使用してユーザーに再生されます。

 if __name__ == "__main__": console.print("[cyan]Assistant started! Press Ctrl+C to exit.") try: while True: console.input( "Press Enter to start recording, then press Enter again to stop." ) data_queue = Queue() # type: ignore[var-annotated] stop_event = threading.Event() recording_thread = threading.Thread( target=record_audio, args=(stop_event, data_queue), ) recording_thread.start() input() stop_event.set() recording_thread.join() audio_data = b"".join(list(data_queue.queue)) audio_np = ( np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 ) if audio_np.size > 0: with console.status("Transcribing...", spinner="earth"): text = transcribe(audio_np) console.print(f"[yellow]You: {text}") with console.status("Generating response...", spinner="earth"): response = get_llm_response(text) sample_rate, audio_array = tts.long_form_synthesize(response) console.print(f"[cyan]Assistant: {response}") play_audio(sample_rate, audio_array) else: console.print( "[red]No audio recorded. Please ensure your microphone is working." ) except KeyboardInterrupt: console.print("\n[red]Exiting...") console.print("[blue]Session ended.")

結果

すべてをまとめたら、上のビデオに示されているようにアプリケーションを実行できます。 Bark モデルは小さいバージョンであっても大きいため、私の MacBook ではアプリケーションの実行が非常に遅くなります。そのため、動画を少しスピードアップしてみました。 CUDA 対応のコンピュータを使用している場合は、より高速に実行できる可能性があります。私たちのアプリケーションの主な機能は次のとおりです。

音声ベースのインタラクション: ユーザーは音声入力の録音を開始および停止でき、アシスタントは生成された音声を再生することで応答します。

会話のコンテキスト:アシスタントは会話のコンテキストを維持し、より一貫性のある適切な応答を可能にします。 Llama-2 言語モデルを使用すると、アシスタントは簡潔で焦点を絞った応答を提供できます。

このアプリケーションを本番環境に対応できる状態に引き上げることを目指す場合は、次の機能強化をお勧めします。

パフォーマンスの最適化: 特にローエンドコンピューターのパフォーマンスを向上させるように設計された、whisper.cpp、llama.cpp、bark.cpp などのモデルの最適化されたバージョンを組み込みます。

カスタマイズ可能なボットプロンプト: ユーザーがボットのペルソナとプロンプトをカスタマイズできるシステムを実装し、さまざまなタイプのアシスタント (個人、専門、ドメイン固有など) の作成を可能にします。

グラフィカルユーザーインターフェイス (GUI) : 全体的なユーザーエクスペリエンスを向上させるユーザーフレンドリーな GUI を開発し、アプリケーションをよりアクセスしやすく、視覚的に魅力的なものにします。

マルチモーダル機能: アプリケーションを拡張して、音声ベースの応答に加えて画像、図、またはその他のビジュアルコンテンツを生成および表示する機能など、マルチモーダルインタラクションをサポートします。

最後に、シンプルな音声アシスタントアプリケーションが完成しました。完全なコードはにあります。音声認識、言語モデリング、およびテキスト読み上げテクノロジーのこの組み合わせは、難しそうに見えても実際にコンピューター上で実行できるものをどのように構築できるかを示しています。コーディングを楽しみましょう。AI とプログラミングの最新記事を見逃さないように、を購読することを忘れないでください。

公開されています

L O A D I N G
. . . comments & more!