The Role of Consequential and Functional Sound in Human-Robot Interaction: Toward Audio Augmented Reality Interfaces

Stanford University

This website gives an overview of our paper, which explores the role of nonverbal sound in human-robot interaction.

Abstract

As robots become increasingly integrated into everyday environments, understanding how they communicate with humans is critical. Sound offers a powerful channel for interaction, encompassing both operational noises and intentionally designed auditory cues.

In this three-part study, we examined the effects of consequential and functional sounds on human perception and behavior, including a novel exploration of spatial sound through localization and handover tasks. The first part utilized a between-subjects design (N=48) to assess how the presence of consequential sounds from a Kinova Gen3 robotic manipulator influences human perceptions of the robot (in person and through video) during a simple pick and place task. The second part (N=51) investigated spatial sound localization accuracy using a augmented reality (AR) environment, where participants identified the source of 3D sounds. The third part (N=41) evaluated the impact of functional and spatial auditory cues on user experience and perception of the robot through a within-subjects design.

Results show that consequential sounds of the Kinova Gen3 manipulator did not negatively affect perceptions, spatial localization is highly accurate for lateral cues but declines for frontal cues, and spatial sounds can simultaneously convey task-relevant information while promoting warmth and reducing discomfort. These findings highlight the potential of functional and transformative auditory design to enhance human-robot collaboration and inform future sound-based interaction strategies.

Video

Participant Recruitment

Fifty-one participants were recruited under IRB protocol 65022. Informed consent was obtained prior to participation, and each session lasted approximately 45 minutes. Participant demographics are summarized below. Likert-scale questions (1 = Strongly Disagree, 7 = Strongly Agree) assessed participants' experience with and attitudes toward robots.

Experiment A: An In-person Replication of the Study on the Effects of Consequential Sounds on Human Perception of Robots

Hypotheses

H1: When observing the robot with sound (through recording or collocation), participants will exhibit more negative associated effects, report lower levels of liking, and express a reduced desire for physical co-location.

H2: Human perceptions of the robot when exposed to consequential sounds through video recordings will be more positive overall compared to those of participants directly colocated with the robot.

Experimental Design

To establish a baseline understanding of how sound influences human perceptions of the Kinova Gen3 manipulator, we replicated a previously conducted between-subjects study that examined similar effects using online videos and surveys. The primary objective was to determine whether consequential sounds elicit negative perceptions toward this specific robot and to assess potential differences between participants who were co-located with the robot and those who observed it through video recordings.

Accordingly, we designed four experimental conditions (shown below). Across all four conditions, the Kinova manipulator executed a standardized pick-and-place task (video below). Participants were not provided any prior information about the task or its purpose before observing the robot. Furthermore, the true objective of the study was withheld to minimize potential bias related to sound perception.

Part 1 methods image.

The four experimental conditions for Experiment A are shown. The Kinova Gen3 manipulator was selected for its suitability in collaborative and household tasks. Participants were assigned to conditions using a quasi-random procedure to ensure equal group sizes.

The Kinova Gen3 manipulator performing the pick-and-place task used in Experiment A.

Results

After observing the robot, participants completed a 11-item Likert-scale questionnaire assessing their perceptions of the robot. The questions measured four perceptual scales: Associated Affect, Distraction, Liking, and Physical Co-location Desire. Those assigned to the sound conditions were additionally asked questions specific to the robot's auditory characteristics.

Part 1 results image.

Box-and-whisker plots illustrating participant responses across the four experimental conditions and four perceptual scales in Experiment A. Higher scores indicate more positive perceptions. Black diamonds represent mean values, black horizontal lines indicate medians, plus signs denote outliers, boxes correspond to the interquartile range (25th-75th percentiles), and whiskers extend to 1.5 times the interquartile range. (N = 48)

One-way ANOVAs revealed no significant differences between the four conditions across all perceptual scales: Associated Effect (p = 0.1256), Distracted (p = 1.0000), Colocate (p = 0.7323), and Like (p = 2.662). Trends in the data, from less positive to more positive perception, were as follows:

  • Associated Effect: In-person, Sound < Recorded, Muted < Recorded, Sound < In-person, Muted (means: 5.58, 5.60, 6.12, 6.48)
  • Distracted: In-person, Muted < In-person, Sound < Recorded, Muted < Recorded, Sound (means: 4.75, 4.92, 5.00, 5.08)
  • Colocate: In-person, Sound < Recorded, Muted < In-person, Muted < Recorded, Sound (means: 3.92, 4.03, 4.33, 4.61)
  • Like: Recorded, Muted < In-person, Sound < In-person, Muted < Recorded, Sound (means: 4.36, 5.03, 5.06, 5.19)

Experiment B: A Study on Spatial Sound Localization

Hypotheses

H3: Participants will more accurately distinguish static sounds originating from their left or right sides (i.e., at larger azimuth angles) than those coming from directly in front of them.

Experimental Design

To investigate participants' ability to discriminate spatial sounds and guide our spatial sound design, we conducted an AR experiment. Participants sat across from the robot while wearing a Microsoft HoloLens 2, which rendered spatialized 360° audio. A custom mixed-reality application presented three red spheres in the participants' field of view, approximately spanning the width of the robot's workspace.

Participants then experienced two additional scenes: one with three green spheres and another with five blue spheres. In each scene, they identified the locations of 3D audio sources, visually represented by the corresponding spheres. For Scenes 1 and 2, audio sources were initially presented sequentially with concurrent visual feedback (spheres oscillating with the sounds) and repeated twice. Visual cues were then removed, and sounds were presented in random order. After each sound, participants verbally indicated the corresponding sphere. This procedure was repeated over two trials, with each sound played once per trial. In Scene 3, participants identified audio sources without prior visual feedback.

Part 2 methods image.

The three virtual scenes, in which spheres represent the positions of the static spatial sounds.

Results

For each scene, the true sequence of audio sources and the corresponding participant-identified (predicted) sequence were recorded across both trials. Quantitative results were analyzed using confusion matrices that compared the predicted and true labels for each scene across all participants. These matrices were used to compute classification accuracy and examine error patterns, providing insight into overall spatial sound localization performance and potential sources of misclassification.

Part 2 results image.

Normalized confusion matrices aggregated for all participants across three scenes.

Participants' accuracy generally declined as scene complexity increased. Accuracy decreased slightly from Trial 1 to Trial 2 in Scene 1, whereas it showed slight improvements in Scenes 2 and 3 across trials. Specifically, overall accuracy was 95% and 93% for Scene 1 (Trials 1 and 2, respectively), 84% and 86% for Scene 2, and 74% and 78% for Scene 3.

Experiment C: Exploring Functional Sounds in Human-Robot Collaboration

Hypotheses

H4: Adding functional sound will increase participants' feelings of competence and reduce participants' feelings of discomfort compared to consequential sounds alone.

H5: Adding spatial sound will increase participants' feelings of warmth and competence, and reduce participants' feelings of discomfort compared to consequential sounds alone.

H6: Participants will accurately interpret the intended meaning of the functional sounds presented.

Experimental Design

Participants were tasked with completing a Lego structure over the course of three trials, with the robot assisting by providing additional Lego pieces as needed. Each trial was conducted under a distinct sound condition: Consequential, Functional, or Spatial.

Participants were informed that they would complete a brief survey after each trial to provide feedback on the robot and the associated sounds; no additional information about the sound conditions was provided to prevent participants from focusing explicitly on the auditory stimuli. Following the three trials, participants completed a post-experiment questionnaire, reflecting on their experiences and offering opinions and recommendations for each of the three sound conditions from memory. Towards the conclusion of the survey, the two augmented sound conditions were replayed, and the experimenter provided an explanation of their functional design. Finally, participants indicated their preferred sound condition and offered any additional comments regarding the sound designs.

Videos demonstrating the three sound conditions used in Experiment C: Consequential (left), Functional (center), and Spatial (right). Note: During the experiment, the music in the Spatial condition was played from the AR headset and spatialized.

Results

Participants completed the Robot Social Attribute Scale (RoSAS) after each trial and completed a custom survey at the end of the experiment. The RoSAS is a validated tool for assessing social perceptions of robots across three dimensions: Warmth, Competence, and Discomfort.

Part 3 results image.

Box-and-whisker plots illustrating participant responses across the three experimental conditions and three attribute scales in Experiment C. Statistical significance from exploratory paired comparisons is indicated by asterisks above brackets (*corresponds to p < 0.05). (N = 41)

Analyses from the post-trial surveys indicated no statistically significant differences across the three attribute scales: Warmth (p = 0.205), Competence (p = 0.384), and Discomfort (p = 0.081). Exploratory post-hoc paired comparisons suggested trends toward differences between specific conditions.

Specifically, the Consequential and Spatial sound conditions showed a trend toward higher Warmth (p = 0.090) and lower Discomfort (p = 0.042) in the Spatial condition, while the Functional and Spatial sound conditions showed a trend toward lower Discomfort (p = 0.052) in the Spatial condition. Although these findings should be interpreted cautiously, given the non-significant overall tests and uncorrected multiple comparisons, they point to potential differences in how sound design influences participants' perceptions and may guide future research.

Trends in the data were as follows:

  • Warmth: Consequential < Functional < Spatial (means: 1.99, 2.01, 2.49)
  • Competence: Consequential < Functional < Spatial (means: 3.47, 3.57, 3.91)
  • Discomfort: Spatial < Consequential < Functional (means: 1.41, 1.70, 1.74)

The following images summarize additional results from Experiment C:

Part 3 results image.

Distribution of responses to the Likert-scale questions for the augmented sound conditions.

Part 3 results image.

Summary of participants' predicted functions for the two augmented sound conditions.

Part 3 results image.

Participants' preferred sound condition.

Additional Results

Design Implications

Additional data were collected to inform future sound design for human-robot interaction.

Part 4 methods image.

Heatmaps depicting participants' ratings of the importance of communicating four distinct categories of information through sound. The left panel shows overall ratings across all participants, the middle panel shows ratings from participants reporting low experience with robots, and the right panel shows ratings from participants reporting high experience with robots. (N = 41)

BibTeX

@misc{smith2025roleconsequentialfunctionalsound,
      title={The Role of Consequential and Functional Sound in Human-Robot Interaction: Toward Audio Augmented Reality Interfaces}, 
      author={Aliyah Smith and Monroe Kennedy III},
      year={2025},
      eprint={2511.15956},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.15956}, 
}