Skip to main content

View Source Code

Browse the complete example on GitHub
A browser driving game you control with your hands and voice, powered by models running fully local. Steer by holding both hands up like a steering wheel. Speak commands to accelerate, brake, toggle headlights, and play music. No cloud calls, no server round-trips. Everything runs in your browser tab.

How it works

Two models run in parallel, entirely client-side:
  • MediaPipe Hand Landmarker tracks your hand positions via webcam at ~30 fps. The angle between your two wrists drives the steering.
  • LFM2.5-Audio-1.5B runs in a Web Worker with ONNX Runtime Web. It listens for speech via the Silero VAD and transcribes each utterance on-device. Matched keywords control game state.
The audio model loads from Hugging Face and is cached in IndexedDB after the first run, so subsequent starts are instant.

Voice commands

SayEffect
speed / fast / goAccelerate to 120 km/h
slow / stop / brakeDecelerate to 0 km/h
lights onEnable headlights
lights offDisable headlights
music / playStart the techno beat
stop music / silenceStop the beat

Prerequisites

Browser Requirements
  • Chrome 113+ or Edge 113+ (WebGPU required for fast audio inference; falls back to WASM)
  • Webcam and microphone access
  • Node.js 18+

Run locally

npm install
npm run dev
Then open http://localhost:3001. On first load the audio model (~900 MB at Q4 quantization) downloads from Hugging Face and is cached in your browser. Hand detection assets load from CDN and MediaPipe’s model storage.

Architecture

Browser tab
β”œβ”€β”€ main thread
β”‚   β”œβ”€β”€ MediaPipe HandLandmarker  (webcam β†’ hand angles β†’ steering)
β”‚   β”œβ”€β”€ Canvas 2D renderer        (road, scenery, dashboard, HUD)
β”‚   └── Web Audio API             (procedural techno synthesizer)
└── audio-worker.js (Web Worker)
    β”œβ”€β”€ Silero VAD                (mic β†’ speech segments)
    └── LFM2.5-Audio-1.5B ONNX   (speech segment β†’ transcript β†’ keyword)
The game loop runs on requestAnimationFrame. Hand detection is throttled to ~30 fps so it does not block rendering. Voice processing happens off the main thread and delivers results via postMessage.

Need help?

Join our Discord

Connect with the community and ask questions about this example.