A Weekend AI Project: Using Speech Recognition, PTT, and a Large Action Model on a Raspberry Pi

At the beginning of 2024, almost all tech reviewers wrote about Rabbit R1 – the first portable "AI assistant" with a $199 cost. Which uses, according to authors, "neuro-symbolic programming" and a LAM ("Large Action Model") to perform different tasks. But how does it work? Well, the best way to know is to make the prototype on our own!
Those readers who have never heard about the Rabbit R1 before can find plenty of YouTube reviews like this:
This article was also inspired by the post of Nabil Alouani, who did an interesting analysis of how the Rabbit R1 could be made:
Rabbit's New AI Device Can "Do Anything" for You by Using Apps – But How Exactly Does It Work?
I will implement similar ideas in Python code, and we will see how it works on real Raspberry Pi hardware and what kind of challenges need to be solved.
Before we begin, a small note: I have no affiliation with the Rabbit team or its sales.
Components
In this article, we will make an AI assistant containing several components:
- A microphone ** and a** Push-to-Talk (PTT) button.
- Automatic Speech Recognition (ASR), which can convert recorded audio data into text.
- A small language model that runs locally on the device. This model will parse actions from the text recognized by the ASR.
- If the action is unknown to a local model, a device will call the public API. Here, two options will be available: we will use an OpenAI API (for those who have a key) and the LLaMA model for those who want a free solution.
- The result (action for a local model or a text response from the "big" model) will be displayed on the device screen.
The code in this article is made for the Raspberry Pi, but it can also be tested on a regular PC as well. And now, let's get started!
Hardware
For this project, I will use a Raspberry Pi 4, a single-board computer running Linux. The Raspberry Pi has plenty of GPIO (general-purpose input/output) pins, which allow us to connect different hardware. It's portable and needs only 5V DC power. A will also connect a 128×64 OLED display and a button; the connection diagram looks like this:

At the moment of this writing, the Raspberry Pi costs about $80–120, depending on the model (the RPi 5 is faster but more expensive) and RAM size (at least 4GB is required to run a language model). A display, button, and set of wires can be bought on Amazon for an extra $10–15. For the sound recording, any USB microphone will do the job. The Raspberry Pi setup is straightforward; there are enough tutorials about that. It is only important to mention that 32- and 64-bit versions of Raspbian are available. We need a 64-bit version because most modern Python libraries are not available in 32-bit versions anymore.
Now, let's talk about software parts.
Push-to-Talk (PTT)
Implementing push-to-talk mode on the Raspberry Pi is relatively straightforward. As we can see on the wiring diagram, the PTT button is connected to one of the pins (in our case, pin 21). To read its value, we first need to import a GPIO library and configure the pin:
try:
import RPi.GPIO as gpio
except (RuntimeError, ImportError):
gpio = https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None
button_pin = 21
gpio.setup(button_pin, gpio.IN, pull_up_down=gpio.PUD_UP)
Here, I set pin 21 as an "input" and enabled the pull-up resistor. A "pull-up" means that when the button is not pressed, the input is connected via the internal resistor to the "power," and its value equals "1." When the button is pressed, the input value equals "0" (so the values in the Python code will be reversed: "1" if the button is not pressed, "0" otherwise).
When the input pin is configured, we need only one line of code to read its value:
value = gpio.input(button_pin)
To make the coding easier, I created a GPIOButton
class, which allows me to remember the last button state. By comparing the state, I can easily detect if the button was pressed or released:
class GPIOButton:
def __init__(self, pin_number: int):
self.pin = pin_number
self.is_pressed = False
self.is_pressed_prev = False
if gpio is not https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None:
gpio.setup(self.pin, gpio.IN, pull_up_down=gpio.PUD_UP)
def update_state(self):
""" Update button state """
self.is_pressed_prev = self.is_pressed
self.is_pressed = self._pin_read(self.pin) == 0
def is_button_pressed(self) -> bool:
""" Button was pressed by user """
return self.is_pressed and not self.is_pressed_prev
def is_button_hold(self) -> bool:
""" Button still pressed by user """
return self.is_pressed and self.is_pressed_prev
def is_button_released(self) -> bool:
""" Button released by user """
return not self.is_pressed and self.is_pressed_prev
def reset_state(self):
""" Clear the button state """
self.is_pressed = False
self.is_pressed_prev = False
def _pin_read(self, pin: int) -> int:
""" Read pin value """
return gpio.input(pin) if gpio is not https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None else 0
This approach also allows us to create a "virtual button" for those who don't have a Raspberry Pi. For example, this "button" is pressed the first 5 seconds after the application is started:
class VirtualButton(GPIOButton):
def __init__(self, delay_sec: int):
super().__init__(pin_number=-1)
self.start_time = time.monotonic()
self.delay_sec = delay_sec
def update_state(self):
""" Update button state: button is pressed first N seconds """
self.is_pressed_prev = self.is_pressed
self.is_pressed = time.monotonic() - self.start_time < self.delay_sec
With a "virtual button," this code can be easily tested on a Windows, Mac, or Linux PC.
Sound Recording and Speech Recognition
With the help of the PTT button, we can record the sound. To do this, I will be using a Python soundcard
library. I will record the audio using 0.5s chunks; this accuracy is good enough for our task:
import soundcard as sc
class SoundRecorder:
""" Sound recorder class """
SAMPLE_RATE = 16000
BUF_LEN_SEC = 60
CHUNK_SIZE_SEC = 0.5
CHUNK_SIZE = int(SAMPLE_RATE*CHUNK_SIZE_SEC)
def __init__(self):
self.data_buf: np.array = https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None
self.chunks_num = 0
def get_microphone(self):
""" Get adefault microphone """
mic = sc.default_microphone()
logging.debug(f"Recording device: {mic}")
return mic.recorder(samplerate=SoundRecorder.SAMPLE_RATE)
def record_chunk(self, mic: Any) -> np.array:
""" Record a new chunk of data """
return mic.record(numframes=SoundRecorder.CHUNK_SIZE)
def start_recording(self, chunk_data: np.array):
""" Start recording a new phrase """
self.chunks_num = 0
self.data_buf = np.zeros(SoundRecorder.SAMPLE_RATE * SoundRecorder.BUF_LEN_SEC, dtype=np.float32)
self._add_to_buffer(chunk_data)
def continue_recording(self, chunk_data: np.array):
""" Continue recording a phrase """
self.chunks_num += 1
self._add_to_buffer(chunk_data)
def get_audio_buffer(self) -> Optional[np.array]:
""" Get audio buffer """
if self.chunks_num > 0:
logging.debug(f"Audio length: {self.chunks_num*SoundRecorder.CHUNK_SIZE_SEC}s")
return self.data_buf[:self.chunks_num*SoundRecorder.CHUNK_SIZE]
return https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None
def _add_to_buffer(self, chunk_data: np.array):
""" Add new data to the buffer """
ind_start = self.chunks_num*SoundRecorder.CHUNK_SIZE
ind_end = (self.chunks_num + 1)*SoundRecorder.CHUNK_SIZE
self.data_buf[ind_start:ind_end] = chunk_data.reshape(-1)
With a PTT button and a sound recorder, we can implement the first part of our "smart assistant" pipeline:
ptt = GPIOButton(pin_number=button_pin)
recorder = SoundRecorder()
with recorder.get_microphone() as mic:
while True:
new_chunk = recorder.record_chunk(mic)
ptt.update_state()
if ptt.is_button_pressed():
# Recording started
recorder.start_recording(new_chunk)
elif ptt.is_button_hold():
recorder.continue_recording(new_chunk)
elif ptt.is_button_released():
buffer = recorder.get_audio_buffer()
if buffer is not https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None:
# Recording is finished
# ...
# Ready for a new phrase
ptt.reset_state()
The full code is represented at the end of the article, but this part is enough to get the idea. Here, we have an infinite "main" loop. The microphone is always active, but the recording starts only when the button is pressed. When the PTT button is released, the audio buffer can be used for speech recognition.
The **** ASR (Automatic Speech Recognition) was already described in my previous article:
A Weekend AI Project: Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
To make this text shorter, I will not repeat the code again; readers are welcome to check the previous part on their own.
Display
In this project, I am using a small 1.4" 128×64 OLED display, which can be bought on Amazon for $3–5. The code was already presented in the previous article. I only did a small refactoring and put all methods in the OLEDDisplay
class:
class OLEDDisplay:
""" Display info on the I2C OLED screen """
def __init__(self):
self.pixels_size = (128, 64)
...
self.app_logo = Image.open("bunny.png").convert('1')
if adafruit_ssd1306 is not https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None and i2c is not https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None:
self.oled = adafruit_ssd1306.SSD1306_I2C(self.pixels_size[0],
self.pixels_size[1],
i2c)
else:
self.oled = https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None
def add_line(self, text: str):
""" Add new line with scrolling """
def add_tokens(self, text: str):
""" Add new tokens with or without extra line break """
def draw_record_screen(self, text: str):
""" Draw logo and text """
logging.debug(f"Draw_record_screen: 33[0;31m{text}33[0m")
if self.oled is https://towardsdatascience.com/a-weekend-ai-project-using-speech-recognition-ptt-and-a-large-action-model-on-a-raspberry-pi-ac8d839d078a/None:
return
image = Image.new("1", self.pixels_size)
img_pos = (self.pixels_size[0] - self.image_logo.size[0])//2
image.paste(self.image_logo, (img_pos, 0))
draw = ImageDraw.Draw(image)
text_size = self._get_text_size(text)
txt_pos = (self.pixels_size[0]//2 - text_size[0]//2,
self.pixels_size[1] - text_size[1])
draw.text(txt_pos, text, font=self.font, fill=255, align="center")
self._draw_image(image)
def _get_text_size(self, text):
""" Get size of the text """
_, descent = self.font.getmetrics()
text_width = self.font.getmask(text).getbbox()[2]
text_height = self.font.getmask(text).getbbox()[3] + descent
return (text_width, text_height)
def _draw_image(self, image: Image):
""" Draw image on display """
I also added a draw_record_screen
method, which shows the "rabbit" logo and text, indicating whether or not the PTT button is pressed. The text is also useful for other status messages. The display, connected to the Raspberry Pi, looks like this:

The flickering is an artifact of the video recording; it's not visible to the human eye. And I am not a visual artist; sorry for my drawing skills
Comment
No reviews yet