Font Size

HOME > No.33, Jun. 2023 > Feature Story : Aiming for natural spoken dialogue between machines and humans

Aiming for natural spoken dialogue between machines and humans

Norihide Kitaoka

Norihide Kitaoka

Nowadays, speech recognition technology is used in devices such as smartphones and smart speakers for a variety of purposes in work and daily life. Nevertheless, there are still limitations to its capabilities, in particular when attempting to create natural dialogue with humans. To fill this gap, Professor Norihide Kitaoka has been researching ways to enable spoken dialogue systems to be "usable" in a wide range of scenarios. The key to natural dialogue, he says, lies not only in giving computers the necessary knowledge for conversation, but also in teaching them how to realize human-like responses, such as the timing of utterances and conversational signals.

Interview and report by Madoka Tainaka

Reducing burden of healthcare workers with voice-input system that supports creation of medical charts

In less than a decade, speech recognition technologies such as Apple’s "Siri" and Amazon’s "Alexa", have advanced to the extent that they have become household names, thanks to deep learning-based end-to-end learning (i.e. network learning that is carried out by a single large neural network model until an output is obtained from inputs). Professor Norihide Kitaoka, whose 30 years of involvement in speech recognition stretch back to the early days of this field, believes that moving forward it will be less important to achieve a high level of precision for speech recognition itself, than to identify specific applications of this technology, and ways in which they can be used.
"Thanks to the advancement of deep learning," he noted, " and the utilization of huge amounts of speech data, the accuracy of speech recognition itself has been improving substantially. Nevertheless, the ways in which these technologies can actually be used are still limited. That is why we are focusing our efforts on researching interfaces that are appropriate to the application."

'Smart Hospital' project carried out in collaboration with Toyohashi Heart Center

As a part of this, Prof. Kitaoka and his colleagues are currently working on the development of a "voice-input medical-chart creation support system" that can be utilized at medical workplaces. This is a tool that creates medical charts on the spot. When doctors perform rounds, the system listens to pick up information on physical condition, body temperature, and so on from dialogues with the patients, and at the same time, it automatically converts the utterances into text, and understands and structures the information.
"This research is part of the 'smart hospital' project that the Toyohashi Heart Center (a hospital specializing in cardiovascular diseases) and Toyohashi University of Technology have been jointly carrying out since 2021, which has already led to the development of a prototype. For example, the system picks up comments such as, 'There are no particular problems with your physical condition," and 'You are on schedule to be able to leave the hospital next week,' automatically transcribes them, and then generates a summary of the information. Thus, this is a system that enables doctors to easily create medical charts, simply by selecting the relevant results for summary on the screen of a smartphone or tablet," explains Professor Kitaoka.

In fact, when they first began their joint research, they had no idea how speech recognition technology could be used in the medical field. Thus, before starting the development, they observed conditions on the ground at the hospital, held extensive discussions with doctors, and identified key challenges.
As Prof. Kitaoka explains, "Although there are already many existing types of software for preparing medical charts using voice input, what doctors needed was a system that could create records on the spot during their rounds. This is because doctors had previously been unable to input the information they had received from patients while visiting their rooms. They would have to take notes, and then input the information into personal computers in the hallways later on. Using our current system makes it possible to create medical charts while listening to patients on the spot, which saves time and effort. Going forward, we will work on linking our system with the electronic medical-chart systems that have already been introduced, and at the same time, we plan to perform fine tuning on our system to facilitate its wider use."

Utilizing ChatGPT for creation of training data and medical charts

In addition to being labor saving for doctors, the medical support systems also need to be capable of outputting correct results by learning the technical terms and information essential for medical practice in advance. The key to achieving this is to have training data to improve accuracy.
"In the smart hospital project," says Prof. Kitaoka, "we are also developing a diagnostic support system using CT images of the heart, but again, the creation of training data is a challenge. When creating training data, we have to avoid overloading the already busy doctors and technicians. As such, we have been attempting to proactively use the existing technologies that are applicable in this area. For example, we have been using the technology of the moment, ChatGPT, along with existing databases in order to summarize information that has been picked up during rounds, and to reinforce the training data."

Meanwhile, Prof. Kitaoka points out that the emergence of ChatGPT presents us an opportunity for fundamentally rethinking the nature of artificial intelligence research.
"It seems that even without expressly making an effort to reproduce the thoughts of humans in computers, if one has access to a large-scale language model, in other words vast amounts of data, one can produce dialogue that is human-like in text form.
In this sense, it appears that our research is standing at a crossroads. This is because we cannot keep up in terms of data volume. Thus, we intend to focus meticulously on our area of expertise, which is 'speech,' and thereby pursue a more natural dialogue between humans and machines."

Natural dialogues with CG character

In his quest to realize a more natural dialogue system, Professor Kitaoka has been focusing on "the timing of responses." Over a decade ago, he developed a chat dialogue system that incorporates the unique characteristics of human dialogue, such as conversational backchanneling that is well-timed or slightly interruptive, and instances of taking over the conversation midway through utterances of the conversation partner. In addition, he has been working on a system that uses machine learning to predict whether a conversation will continue or end based on the pitch of the other person's voice and intonation patterns, and on speech synthesis that can change the tone of the voice to express anger, joy, and other emotions.

3D computer graphics character 'Saya' on  screen
3D computer graphics character 'Saya' on screen

"One area in which this research can be applied is the development of a dialogue system with the 3D computer graphic character 'Saya' which was launched in 2015 by the 3D computer graphic artist TELYUKA. Saya shot to fame when, in spite of her being an imaginary character, she was selected as the winner of one of the 'Miss iD' auditions for entertainers hosted by a publishing company. Saya is now able to interact with humans by incorporating speech recognition, text-to-speech, and chatting functions, as well as image recognition tech nology provided by Aisin (Kariya, Japan). She is equipped with functions that enable her to pick out 'focus words' when she is listening to someone speaking, and to change the topic of conversation after there has been an extended pause. Furthermore, we have continuously been making improvements so that she is able to express herself more naturally by conversational backchanneling and the conveyance of feelings."
Prof. Kitaoka adds that, going forward, his team hope to expand the Saya dialogue system to include giving street directions, as well as watching over elderly persons and children at facilities.

Prof. Kitaoka is still looking to the future, examining whether it will be possible to install the same system in anime characters other than Saya. "Machines can talk to humans indefinitely, without ever getting tired" he points out. "If you think about this point, you can see how the potential for applications will likely multiply."

Shifting to multimodal communication

If one is aiming for truly natural dialogue between humans and machines however, conversational ability alone is not enough. To this end, Prof. Kitaoka and his colleagues have been working on the development of a "multimodal dialogue" system, which means that they not only engage in communication through speech but also incorporate gestures such as body and hand motions, eye movements, and so on.

Multimodal communication enabled autonomous vehicle

"I worked on a similar application while at Tokushima University in 2018, where we jointly developed a multimodal interactive self-driving car in collaboration with Nagoya University and Aisin Seiki (now Aisin). The experience of using this system was designed to function something like giving instructions to a taxi driver. While the vehicle is in motion, it is possible to look at something and ask the vehicle what it is, and to instruct the vehicle to change directions through gestures."

Prof. Kitaoka has also been working on developing a system that can, for example recognize information written on a blackboard that is pointed to with one’s finger, such as angles during a math class for example, and then can input the information as symbols or formulas on a display.
"In the future, I believe that it may be possible for us to release our dialogue system as a toolkit that can be integrated into existing spoken dialogue applications," says Professor Kitaoka. If he can achieve this, people will surely be able to enjoy more natural dialogue with their smartphones and PCs, such as by using gestures. We look forward to future developments.

Reporter's Note

Prof. Kitaoka has been fascinated by personal computers since he was in elementary school, where they were introduced for the first time. Although he did not have a personal computer, his parents were sufficiently moved to see him copying programming notes on paper that they eventually bought him one. This led him towards his long held dream of a place in a natural language processing laboratory at university, but the overwhelming number of applicants there persuaded him to switch to speech research. Since then, he has never looked back.

He points out, "When I initially began my research, it was the middle of the second AI boom, and I wanted to find an elegant way to engage in dialogue with machines by giving the machines human-like intelligence. Now however, I have come to believe that 'human-like' qualities reside in behaviors that are expressed outside of language. Realizing these is the real challenge."

It is precisely in these difficult-to-quantify aspects that I believe the key lies in removing the sense of discomfort that arises when communicating with machines.

Share this story

Researcher Profile

Norihide Kitaoka

Norihide Kitaoka

Norihide kitaoka received PhD degree in 2000 from Toyohashi University of Technology, Aichi, Japan. Since he started his career at Toyohashi University of Technology as a research associate in 2001, has been involved in speech information processing. He is currently a professor at the Department of Computer Science and Engineering.

Reporter Profile

Madoka Tainaka

Madoka Tainaka
Editor and writer. Former committee member on the Ministry of Education, Culture, Sports, Science and Technology Council for Science and Technology, Information Science Technology Committee and editor at NII Today, a publication from the National Institute of Informatics. She interviews researchers at universities and businesses, produces content for executives, and also plans, edits, and writes books.