Blind and visually impaired individuals (BVI) often face challenges related to orientation and mobility in their daily-life tasks. Navigation, or wayfinding, is one of the key components of mobility. Assistive devices and technologies, including mobile apps, are intended to help the BVI with a wide-variety of daily-life tasks, including navigation.
1 Navigation can be further classified as macro- or micronavigation. When one performs a navigation task of getting from point A to point B in its entirety, macronavigation refers to the high-level aspects of path/route planning and following-up, generally over larger distances (say over many miles), and often facilitated by geo-localization/mapping technologies.
2,3 Micronavigation, on the other hand, refers to navigating within a close range (sometimes within a few meters) during the journey—precisely to a particular location, for instance, finding store entrances, train station exits, and so on. Planning and execution of micronavigation tasks is often ignored or taken for granted in the general context of navigation by sighted humans. Although a macronavigation tool can guide BVI individuals to the vicinity of a building, yet getting into the building or getting to the elevator entrance, can be challenging micronavigation tasks that are also essential aspects of the entire trip. However, this is exactly the kind of the task that is largely unresolved for BVI individuals who want to travel independently,
4,5 because there is no general-purpose micronavigation tool or aid and they frequently have to resort to asking others for help.
Although there are a large number of devices and smartphone apps for navigation assistance, overwhelming number of them are for macronavigation.
2,3 Mapping and geolocation based apps are not particularly adequate for micronavigation, not only because of the inherent errors in geolocalization and mapping,
6 but also because of an overall lack of mapping or insufficient mapping, as it is not feasible to accurately map all the locations. Places away from streets or inside buildings are often not mapped. Some vision aids and assistance apps make use of computer vision algorithms to perform object detection
1 or provide micronavigation assistance in certain specific scenarios—such as public transit specific information.
7,8 Because of the sheer variety of objects one could encounter in the real world, the development of custom tools for many kinds of specific objects still has a long way to go. Moreover, using customized solutions for different scenarios, for instance, navigating to bus stops, has limited operational feasibility from the perspective of BVI users. Apps and services such as Aira, which provide live, remote personalized assistance to BVI individuals could be helpful in navigation.
9–12 However, cost and feasibility of human assistance means that its utilization tends to be limited. A 24 × 7 virtual assistant could alleviate many of these micronavigation related challenges for BVI individuals. Such a virtual assistant for BVI travelers is not out of the realm of possibility, given the recent advances in artificial intelligence models for computer vision, large language models (LLMs), and visual-language models (VLMs).
13,14
VLMs are designed for tasks that require some combination of computer vision and natural-language inputs/outputs such as image captioning or visual question answering, among others.
13 VLMs have been used for navigation, especially for robot navigation, where the idea is that the robot will take natural language instructions, extract salient information from the text (such as landmarks and their inter-relationship), and then perform the navigation tasks based on visual detection of the said landmarks in previously unseen environments.
15–25 Although the vision-language navigation approaches are focused understanding image- and language-based cues for navigation by autonomous agents, our goal in this study was to evaluate whether general purpose foundational models such as ChatGPT 4o have the ability to prove direct and precise responses to micronavigation-related queries of BVI users.
The task of providing navigation information by an AI agent involved two components: (1) scene understanding, where navigation-relevant information needs to be extracted from input images of the scene; and (2) generation of specific instructions/directions for the human users based on natural language interaction.
Abilities of AI models to interact with humans are now evident, with the advent of AI agents like ChatGPT in our day-to-day lives. However, scene understanding functionalities of these AI agents is relatively new, for example ChatGPT 4o with image inputs
26 or a large language-and-vision assistant.
19 Their ability to understand navigation-specific scenarios and provide clear instructions to humans is not fully evident as yet.
Be My AI function in the Be My Eyes app was introduced as a service for visually impaired individuals in scene understanding tasks, where a picture of the scene is interpreted by OpenAI's model to generate scene description. Other similar efforts such as Aira's initiative to build AI datasets to assist BVI users are also ongoing. Because these are relatively new services, rigorous studies and scientific reports about their effectiveness are scarce. Also, because these are designed to be a general purpose scene understanding agents/services, their effectiveness in navigation-specific scenarios is unknown.
In our evaluation study, we created a dataset of navigation scenarios by capturing a variety of real-world scene images. Scene description for navigation purposes, based on the captured pictures, was written by humans. Multiple positive (answerable) and negative (non-answerable) query-response pairs specific to each scenario (human-generated) were created. From there on, we tried various combinations of different input data formats and prompt-based training approaches to ChatGPT 4o with the goal of eliciting actionable responses from the agent. We compared the accuracy of the navigation directions to our queries provided by the AI-agents with the human generated responses. Our goal in this study was to understand feasibility and limitations of using AI agents like ChatGPT 4o in micronavigation scenarios. It did not involve BVI users, but this early-stage study can guide future developments in this field, including whether and how human subject studies may need to be conducted.