A Surveillance Based Interactive Robot
- URL: http://arxiv.org/abs/2508.13319v1
- Date: Mon, 18 Aug 2025 19:09:43 GMT
- Title: A Surveillance Based Interactive Robot
- Authors: Kshitij Kavimandan, Pooja Mangal, Devanshi Mehta,
- Abstract summary: We build a mobile surveillance robot that streams video in real time and responds to speech so a user can monitor and steer it from a phone or browser.<n>The system uses two Raspberry Pi 4 units: a front unit on a differential drive base with camera, mic, and speaker, and a central unit that serves the live feed and runs perception.<n>For voice interaction, we use Python libraries for speech recognition, multilingual translation, and text-to-speech.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We build a mobile surveillance robot that streams video in real time and responds to speech so a user can monitor and steer it from a phone or browser. The system uses two Raspberry Pi 4 units: a front unit on a differential drive base with camera, mic, and speaker, and a central unit that serves the live feed and runs perception. Video is sent with FFmpeg. Objects in the scene are detected using YOLOv3 to support navigation and event awareness. For voice interaction, we use Python libraries for speech recognition, multilingual translation, and text-to-speech, so the robot can take spoken commands and read back responses in the requested language. A Kinect RGB-D sensor provides visual input and obstacle cues. In indoor tests the robot detects common objects at interactive frame rates on CPU, recognises commands reliably, and translates them to actions without manual control. The design relies on off-the-shelf hardware and open software, making it easy to reproduce. We discuss limits and practical extensions, including sensor fusion with ultrasonic range data, GPU acceleration, and adding face and text recognition.
Related papers
- An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction [0.0]
This work presents a novel HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic.<n>The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition.<n> Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75%.
arXiv Detail & Related papers (2026-02-23T09:05:15Z) - Extraction Of Cumulative Blobs From Dynamic Gestures [0.0]
Gesture recognition is based on CV technology that allows the computer to interpret human motions as commands.<n>A simple night vision camera can be used as our camera for motion capture.<n>The video stream from the camera is fed into a Raspberry Pi which has a Python program running OpenCV module.
arXiv Detail & Related papers (2025-01-07T18:59:28Z) - ChatCam: Empowering Camera Control through Conversational AI [67.31920821192323]
ChatCam is a system that navigates camera movements through conversations with users.
To achieve this, we propose CineGPT, a GPT-based autoregressive model for text-conditioned camera trajectory generation.
We also develop an Anchor Determinator to ensure precise camera trajectory placement.
arXiv Detail & Related papers (2024-09-25T20:13:41Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Artificial Eye for the Blind [0.0]
The main backbone of our Artificial Eye model is the Raspberry pi3 which is connected to the webcam.
We also run all our software models i.e object detection, Optical Character recognition, google text to speech conversion and the Mycroft voice assistance model.
arXiv Detail & Related papers (2023-07-07T10:00:50Z) - InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT
Beyond Language [82.92236977726655]
InternGPT stands for textbfinteraction, textbfnonverbal, and textbfchatbots.
We present an interactive visual framework named InternGPT, or iGPT for short.
arXiv Detail & Related papers (2023-05-09T17:58:34Z) - Natural Language Robot Programming: NLP integrated with autonomous
robotic grasping [1.7045152415056037]
We present a grammar-based natural language framework for robot programming, specifically for pick-and-place tasks.
Our approach uses a custom dictionary of action words, designed to store together words that share meaning.
We validate our framework through simulation and real-world experimentation, using a Franka Panda robotic arm.
arXiv Detail & Related papers (2023-04-06T11:06:30Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Implementation Of Tiny Machine Learning Models On Arduino 33 BLE For
Gesture And Speech Recognition [6.8324958655038195]
Here in the implementation of hand gesture recognition, TinyML model is trained and deployed from EdgeImpulse framework for hand gesture recognition.
In the implementation of speech recognition, TinyML model is trained and deployed from EdgeImpulse framework for speech recognition.
Arduino Nano 33 BLE device having built-in microphone can make an RGB LED glow like red, green or blue based on keyword pronounced.
arXiv Detail & Related papers (2022-07-23T10:53:26Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - VGAI: End-to-End Learning of Vision-Based Decentralized Controllers for
Robot Swarms [237.25930757584047]
We propose to learn decentralized controllers based on solely raw visual inputs.
For the first time, that integrates the learning of two key components: communication and visual perception.
Our proposed learning framework combines a convolutional neural network (CNN) for each robot to extract messages from the visual inputs, and a graph neural network (GNN) over the entire swarm to transmit, receive and process these messages.
arXiv Detail & Related papers (2020-02-06T15:25:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.