Advanced Computing in the Age of AI | Monday, May 13, 2024

OpenAI Announces Voice and Image Interaction in ChatGPT 

Sept. 27, 2023 -- OpenAI has announced the roll out of new voice and image capabilities in ChatGPT. This advancement offers users a novel and more intuitive interface, enabling them to engage in voice conversations or visually share topics with ChatGPT.

Credit: Ascannio/Shutterstock

In the upcoming two weeks, OpenAI plans to introduce voice and image features to Plus and Enterprise users. Voice functionality will be available on iOS and Android platforms (with an opt-in option in settings) and image support will span all platforms.

The voice interaction will enable users to maintain dynamic conversations with the assistant. It is now possible to request stories, seek clarifications, or resolve debates using the voice feature.

Use Voice to Engage in a Back-And-Forth Conversation with Your Assistant

This groundbreaking voice capability is backed by a new text-to-speech model that produces human-like audio from just text and a brief speech sample. OpenAI worked closely with professional voice talents to develop these voices. Additionally, Whisper, OpenAI’s open-source speech recognition system, is employed to convert spoken words into text.

Chat About Images

Users can now present ChatGPT with single or multiple images. They can ascertain why a gadget fails to operate, decide on meals using stored food items, or evaluate intricate graphs for work data. A drawing tool available in the mobile application assists users in emphasizing specific image areas.

Image comprehension is powered by multimodal GPT-3.5 and GPT-4. These models employ their linguistic analytic abilities to process diverse images, including photographs, screenshots, and documents combining text and visuals.

Deployment Strategy for Image And Voice Capabilities

OpenAI's commitment remains in developing AGI that prioritizes safety and value. The phased introduction of tools aids in refining improvements, risk countermeasures, and prepares users for more advanced systems. This gradual approach is pivotal, especially with state-of-the-art models that incorporate voice and vision.

OpenAI acknowledges the potential risks associated with the voice technology, particularly its misuse by malicious entities. The technology’s current application is confined to voice chat, with collaborations expanding its reach, such as Spotify's Voice Translation feature pilot.

With regard to vision models, OpenAI recognizes challenges like false image interpretations. Prior to a comprehensive launch, risks were assessed in fields like extremism and scientific competency. OpenAI's research has clarified protocols for responsible application.

For ChatGPT to be most effective, it should interpret what users perceive. Collaborative efforts with Be My Eyes, an application assisting the visually impaired, have enhanced its capabilities and set guidelines for user privacy. Continuous real-world utilization and feedback are crucial for refining safeguards.

Transparency Regarding Model Limitations

Users often rely on ChatGPT for niche subjects. OpenAI is candid about the model’s confines and dissuades users from high-risk applications without adequate verification. While proficient in transcribing English, the model's efficiency diminishes with some non-English languages, especially those employing non-roman scripts.

For a comprehensive understanding of OpenAI's safety approach and collaboration with Be My Eyes, users can refer to the system card for image input.

Future Expansion Plans

Voice and image functionalities will be accessible to Plus and Enterprise users in the forthcoming weeks. OpenAI eagerly anticipates extending these features to other user categories, including developers, in the near future.


Source: OpenAI

EnterpriseAI