Vision-Enabled Conversational AI

I need a single AI application that can see, hear and speak to the user. Using my own OpenAI key (or, if you prefer, a Gemini or Claude endpoint), I want you to wire conversational logic with the device camera so the assistant can recognise whatever the lens captures—faces, emotions, objects, actions, text, you name it—then hold a natural dialogue about what it sees. The build has to run everywhere: a mobile version for iOS & Android, a web app that works in the browser, and a desktop release for Windows and macOS. Users should be able to create an account, log in, and start interacting immediately. Speech-to-text converts their voice to prompts, vision models process the live feed, and text-to-speech delivers the reply in real time. For LLM calls, default to ChatGPT via the OpenAI API, but keep the code modular so I can drop in GPT-5, Gemini or Claude with minimal edits. Deliverables • Cross-platform source code with clear build/run instructions • Login/registration module tied to the LLM calls • Real-time camera inference for “everything” detection and contextual dialogue • Speech recognition and synthesis wired into the chat flow • A short demo video or live link proving the system works on all three platform families I’ll test by installing each build, pointing the camera at random scenes and confirming the assistant both describes what it sees and holds a coherent conversation about it. Let’s make something that feels like it came straight out of the year 3000.

Приложения для android

Регистрация