1,615 판독값

나만의 음성 어시스턴트를 구축하고 Whisper + Ollama + Bark를 사용하여 로컬에서 실행하는 방법

~에 의해 Duy Huynh13m2024/04/02

너무 오래; 읽다

음성 기반 상호 작용: 사용자는 음성 입력 녹음을 시작 및 중지할 수 있으며 보조자는 생성된 오디오를 재생하여 응답합니다. 대화 맥락: 어시스턴트는 대화의 맥락을 유지하여 더욱 일관되고 관련성이 높은 응답을 가능하게 합니다. Llama-2 언어 모델을 사용하면 보조자가 간결하고 집중된 응답을 제공할 수 있습니다.

featured image - 나만의 음성 어시스턴트를 구축하고 Whisper + Ollama + Bark를 사용하여 로컬에서 실행하는 방법

자신만의 RAG를 구축하고 로컬에서 실행하는 방법에 대한 최근 게시물에 이어, 오늘 우리는 대규모 언어 모델의 대화 기능을 구현하는 것뿐만 아니라 듣기 및 말하기 기능을 추가하여 한 단계 더 발전할 것입니다. 아이디어는 간단합니다. 우리는 컴퓨터에서 오프라인으로 작동할 수 있는 상징적인 Iron Man 영화의 Jarvis나 Friday를 연상시키는 음성 비서를 만들 것입니다.

이것은 입문 튜토리얼이므로 Python으로 구현하고 초보자도 쉽게 사용할 수 있도록 하겠습니다. 마지막으로 애플리케이션을 확장하는 방법에 대한 몇 가지 지침을 제공하겠습니다.

테크스택

먼저 가상 Python 환경을 설정해야 합니다. 이를 위해 pyenv, virtualenv, poem 및 유사한 목적을 제공하는 기타 옵션을 포함하여 여러 가지 옵션이 있습니다. 개인적으로 저는 개인적 선호로 인해 이 튜토리얼에서 Poetry를 사용하겠습니다. 설치해야 할 몇 가지 중요한 라이브러리는 다음과 같습니다.

: 시각적으로 매력적인 콘솔 출력을 위한 것입니다.
: 음성을 텍스트로 변환하는 강력한 도구입니다.
: 텍스트 음성 변환 합성을 위한 최첨단 라이브러리로 고품질 오디오 출력을 보장합니다.
: LLM(대형 언어 모델)과의 인터페이스를 위한 간단한 라이브러리입니다.
, 및 : 오디오 녹음 및 재생에 필수적입니다.

자세한 종속성 목록은 링크를 참조하세요.

여기서 가장 중요한 구성 요소는 Ollama를 사용할 LLM(Large Language Model) 백엔드입니다. 오프라인에서 LLM을 실행하고 제공하는 데 널리 사용되는 도구로 널리 알려져 있습니다. Ollama가 처음이라면 오프라인 RAG에 대한 이전 기사인 기본적으로 Ollama 애플리케이션을 다운로드하고 선호하는 모델을 가져와서 실행하기만 하면 됩니다.

건축학

자, 모든 설정이 끝났다면 다음 단계로 넘어가겠습니다. 다음은 기본적으로 3가지 주요 구성 요소로 구성된 애플리케이션의 전체 아키텍처입니다.

음성 인식 : 활용하여 음성 언어를 텍스트로 변환합니다. 다양한 데이터 세트에 대한 Whisper의 교육은 다양한 언어와 방언에 대한 능숙도를 보장합니다.

대화 체인 : 대화 기능을 위해 Ollama를 사용하여 제공되는 모델용 Langchain 인터페이스를 사용합니다. 이 설정은 원활하고 매력적인 대화 흐름을 약속합니다.

음성 합성기 : 실제와 같은 음성 생성으로 유명한 Suno AI의 최첨단 모델인 통해 텍스트를 음성으로 변환합니다.

작업 흐름은 간단합니다. 음성을 녹음하고, 텍스트로 전사하고, LLM을 사용하여 응답을 생성하고, Bark를 사용하여 응답을 음성화합니다.

Whisper, Ollama 및 Bark를 사용한 음성 어시스턴트의 시퀀스 다이어그램.

구현

구현은 다음과 같이 텍스트에서 음성을 합성하고 더 긴 텍스트 입력을 원활하게 처리하는 방법을 통합하여 Bark를 기반으로 TextToSpeechService 만드는 것으로 시작됩니다.

 import nltk import torch import warnings import numpy as np from transformers import AutoProcessor, BarkModel warnings.filterwarnings( "ignore", message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.", ) class TextToSpeechService: def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"): """ Initializes the TextToSpeechService class. Args: device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu". Defaults to "cuda" if available, otherwise "cpu". """ self.device = device self.processor = AutoProcessor.from_pretrained("suno/bark-small") self.model = BarkModel.from_pretrained("suno/bark-small") self.model.to(self.device) def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): audio_array = self.model.generate(**inputs, pad_token_id=10000) audio_array = audio_array.cpu().numpy().squeeze() sample_rate = self.model.generation_config.sample_rate return sample_rate, audio_array def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given long-form text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ pieces = [] sentences = nltk.sent_tokenize(text) silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate)) for sent in sentences: sample_rate, audio_array = self.synthesize(sent, voice_preset) pieces += [audio_array, silence.copy()] return self.model.generation_config.sample_rate, np.concatenate(pieces)

초기화( __init__ ) : 클래스는 모델에 사용할 장치를 지정하는 선택적 device 매개변수를 사용합니다(GPU를 사용할 수 있는 경우 cuda 또는 cpu ). suno/bark-small 사전 훈련된 모델에서 Bark 모델과 해당 프로세서를 로드합니다. 모델 로더에 suno/bark 지정하여 대형 버전을 사용할 수도 있습니다.

Synthesize( synthesize ) : 이 메서드는 text 입력과 합성에 사용할 음성을 지정하는 voice_preset 매개변수를 사용합니다. 다른 voice_preset 값을 확인할 수 있습니다. processor 사용하여 입력 텍스트와 음성 사전 설정을 준비한 다음 model.generate() 메서드를 사용하여 오디오 배열을 생성합니다. 생성된 오디오 배열은 NumPy 배열로 변환되고 샘플 속도는 오디오 배열과 함께 반환됩니다.

Long-form Synthesize ( long_form_synthesize ) : 이 방법은 긴 텍스트 입력을 합성하는 데 사용됩니다. 먼저 nltk.sent_tokenize 함수를 사용하여 입력 텍스트를 문장으로 토큰화합니다. 각 문장에 대해 synthesize 메서드를 호출하여 오디오 배열을 생성합니다. 그런 다음 각 문장 사이에 짧은 묵음(0.25초)을 추가하여 생성된 오디오 배열을 연결합니다.

이제 TextToSpeechService 가 설정되었으므로 LLM(대형 언어 모델) 제공을 위해 Ollama 서버를 준비해야 합니다. 이렇게 하려면 다음 단계를 따라야 합니다.

최신 Llama-2 모델 가져오기 : 다음 명령을 실행하여 Ollama 저장소에서 최신 Llama-2 모델을 다운로드합니다. ollama pull llama2 .

Ollama 서버 시작 : 서버가 아직 시작되지 않은 경우 ollama serve 명령을 실행하여 시작합니다.

이 단계를 완료하면 애플리케이션에서 Ollama 서버와 Llama-2 모델을 사용하여 사용자 입력에 대한 응답을 생성할 수 있습니다.

다음으로 기본 애플리케이션 로직으로 이동하겠습니다. 먼저 다음 구성요소를 초기화해야 합니다.

Rich Console : Rich 라이브러리를 사용하여 터미널 내에서 사용자를 위한 더 나은 대화형 콘솔을 만듭니다.

Whisper Speech-to-Text : OpenAI에서 개발한 최첨단 오픈 소스 음성 인식 시스템인 Whisper 음성 인식 모델을 초기화하겠습니다. 사용자 입력을 기록하기 위해 기본 영어 모델( base.en )을 사용합니다.

Bark Text-to-Speech : 위에서 구현한 Bark 텍스트 음성 변환 합성기 인스턴스를 초기화하겠습니다.

대화 체인 : 대화 흐름을 관리하기 위한 템플릿을 제공하는 Langchain 라이브러리의 내장 ConversationalChain 사용합니다. Ollama 백엔드와 함께 Llama-2 언어 모델을 사용하도록 구성하겠습니다.

 import time import threading import numpy as np import whisper import sounddevice as sd from queue import Queue from rich.console import Console from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain.prompts import PromptTemplate from langchain_community.llms import Ollama from tts import TextToSpeechService console = Console() stt = whisper.load_model("base.en") tts = TextToSpeechService() template = """ You are a helpful and friendly AI assistant. You are polite, respectful, and aim to provide concise responses of less than 20 words. The conversation transcript is as follows: {history} And here is the user's follow-up: {input} Your response: """ PROMPT = PromptTemplate(input_variables=["history", "input"], template=template) chain = ConversationChain( prompt=PROMPT, verbose=False, memory=ConversationBufferMemory(ai_prefix="Assistant:"), llm=Ollama(), )

이제 필요한 기능을 정의해 보겠습니다.

record_audio : 이 함수는 sounddevice.RawInputStream 을 사용하여 사용자 마이크에서 오디오 데이터를 캡처하기 위해 별도의 스레드에서 실행됩니다. 콜백 함수는 새로운 오디오 데이터를 사용할 수 있을 때마다 호출되며 추가 처리를 위해 데이터를 data_queue 에 넣습니다.

transcribe : 이 함수는 Whisper 인스턴스를 활용하여 data_queue 의 오디오 데이터를 텍스트로 변환합니다.

get_llm_response : 이 함수는 현재 대화 컨텍스트를 Llama-2 언어 모델(Langchain ConversationalChain 통해)에 제공하고 생성된 텍스트 응답을 검색합니다.

play_audio : 이 함수는 Bark 텍스트 음성 변환 엔진에서 생성된 오디오 파형을 가져와 사운드 재생 라이브러리(예: sounddevice )를 사용하여 사용자에게 재생합니다.

 def record_audio(stop_event, data_queue): """ Captures audio data from the user's microphone and adds it to a queue for further processing. Args: stop_event (threading.Event): An event that, when set, signals the function to stop recording. data_queue (queue.Queue): A queue to which the recorded audio data will be added. Returns: None """ def callback(indata, frames, time, status): if status: console.print(status) data_queue.put(bytes(indata)) with sd.RawInputStream( samplerate=16000, dtype="int16", channels=1, callback=callback ): while not stop_event.is_set(): time.sleep(0.1) def transcribe(audio_np: np.ndarray) -> str: """ Transcribes the given audio data using the Whisper speech recognition model. Args: audio_np (numpy.ndarray): The audio data to be transcribed. Returns: str: The transcribed text. """ result = stt.transcribe(audio_np, fp16=False) # Set fp16=True if using a GPU text = result["text"].strip() return text def get_llm_response(text: str) -> str: """ Generates a response to the given text using the Llama-2 language model. Args: text (str): The input text to be processed. Returns: str: The generated response. """ response = chain.predict(input=text) if response.startswith("Assistant:"): response = response[len("Assistant:") :].strip() return response def play_audio(sample_rate, audio_array): """ Plays the given audio data using the sounddevice library. Args: sample_rate (int): The sample rate of the audio data. audio_array (numpy.ndarray): The audio data to be played. Returns: None """ sd.play(audio_array, sample_rate) sd.wait()

그런 다음 기본 애플리케이션 루프를 정의합니다. 기본 애플리케이션 루프는 다음과 같이 대화 상호 작용을 통해 사용자를 안내합니다.

입력 기록을 시작하려면 Enter를 누르라는 메시지가 사용자에게 표시됩니다.
사용자가 Enter 키를 누르면 사용자의 오디오 입력을 캡처하기 위해 별도의 스레드에서 record_audio 함수가 호출됩니다.
사용자가 다시 Enter를 눌러 녹음을 중지하면 transcribe 기능을 사용하여 오디오 데이터가 녹음됩니다.
그런 다음 복사된 텍스트는 Llama-2 언어 모델을 사용하여 응답을 생성하는 get_llm_response 함수로 전달됩니다.
생성된 응답은 콘솔에 인쇄되고 play_audio 함수를 사용하여 사용자에게 재생됩니다.

 if __name__ == "__main__": console.print("[cyan]Assistant started! Press Ctrl+C to exit.") try: while True: console.input( "Press Enter to start recording, then press Enter again to stop." ) data_queue = Queue() # type: ignore[var-annotated] stop_event = threading.Event() recording_thread = threading.Thread( target=record_audio, args=(stop_event, data_queue), ) recording_thread.start() input() stop_event.set() recording_thread.join() audio_data = b"".join(list(data_queue.queue)) audio_np = ( np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 ) if audio_np.size > 0: with console.status("Transcribing...", spinner="earth"): text = transcribe(audio_np) console.print(f"[yellow]You: {text}") with console.status("Generating response...", spinner="earth"): response = get_llm_response(text) sample_rate, audio_array = tts.long_form_synthesize(response) console.print(f"[cyan]Assistant: {response}") play_audio(sample_rate, audio_array) else: console.print( "[red]No audio recorded. Please ensure your microphone is working." ) except KeyboardInterrupt: console.print("\n[red]Exiting...") console.print("[blue]Session ended.")

결과

모든 것이 함께 내려지면 위 비디오와 같이 애플리케이션을 실행할 수 있습니다. Bark 모델은 더 작은 버전에서도 크기 때문에 내 MacBook에서 응용 프로그램이 매우 느리게 실행됩니다. 그래서 영상 속도를 조금 높였습니다. CUDA 지원 컴퓨터를 사용하는 경우 더 빠르게 실행될 수 있습니다. 우리 애플리케이션의 주요 기능은 다음과 같습니다.

음성 기반 상호작용 : 사용자는 음성 입력 녹음을 시작 및 중지할 수 있으며, 보조자는 생성된 오디오를 재생하여 응답합니다.

대화 맥락: 어시스턴트는 대화의 맥락을 유지하여 보다 일관되고 관련성이 높은 응답을 가능하게 합니다. Llama-2 언어 모델을 사용하면 보조자가 간결하고 집중된 응답을 제공할 수 있습니다.

이 애플리케이션을 프로덕션 준비 상태로 향상시키려는 경우 다음과 같은 개선 사항이 권장됩니다.

성능 최적화 : 특히 저사양 컴퓨터에서 성능을 향상시키도록 설계된 Whisper.cpp, llama.cpp 및 bark.cpp와 같은 모델의 최적화된 버전을 통합합니다.

사용자 정의 가능한 봇 프롬프트 : 사용자가 봇의 페르소나와 프롬프트를 사용자 정의할 수 있는 시스템을 구현하여 다양한 유형의 보조자(예: 개인, 전문가 또는 도메인별)를 생성할 수 있습니다.

그래픽 사용자 인터페이스(GUI) : 사용자 친화적인 GUI를 개발하여 전반적인 사용자 경험을 향상시켜 애플리케이션의 접근성을 높이고 시각적으로 매력적으로 만듭니다.

다중 모드 기능 : 음성 기반 응답 외에 이미지, 다이어그램 또는 기타 시각적 콘텐츠를 생성하고 표시하는 기능과 같은 다중 모드 상호 작용을 지원하도록 애플리케이션을 확장합니다.

마지막으로 간단한 음성 지원 애플리케이션을 완성했습니다. 전체 코드는 에서 찾을 수 있습니다. 음성 인식, 언어 모델링 및 텍스트 음성 변환 기술의 조합은 어려워 보이지만 실제로 컴퓨터에서 실행할 수 있는 무언가를 구축할 수 있는 방법을 보여줍니다. 코딩을 즐겨 보시고, AI와 프로그래밍 관련 최신 기사를 놓치지 않도록 구독도 잊지 마세요.

게시됨

L O A D I N G
. . . comments & more!