Font Size

HOME > No.33, Jun. 2023 > Feature Story : Aiming for natural spoken dialogue between machines and humans

Aiming for natural spoken dialogue between machines and humans

Norihide Kitaoka

Nowadays, speech recognition technology is used in devices such as smartphones and smart speakers for a variety of purposes in work and daily life. Nevertheless, there are still limitations to its capabilities, in particular when attempting to create natural dialogue with humans. To fill this gap, Professor Norihide Kitaoka has been researching ways to enable spoken dialogue systems to be "usable" in a wide range of scenarios. The key to natural dialogue, he says, lies not only in giving computers the necessary knowledge for conversation, but also in teaching them how to realize human-like responses, such as the timing of utterances and conversational signals.

Interview and report by Madoka Tainaka

Reducing burden of healthcare workers with voice-input system that supports creation of medical charts

In less than a decade, speech recognition technologies such as Apple’s "Siri" and Amazon’s "Alexa", have advanced to the extent that they have become household names, thanks to deep learning-based end-to-end learning (i.e. network learning that is carried out by a single large neural network model until an output is obtained from inputs). Professor Norihide Kitaoka, whose 30 years of involvement in speech recognition stretch back to the early days of this field, believes that moving forward it will be less important to achieve a high level of precision for speech recognition itself, than to identify specific applications of this technology, and ways in which they can be used.
"Thanks to the advancement of deep learning," he noted, " and the utilization of huge amounts of speech data, the accuracy of speech recognition itself has been improving substantially. Nevertheless, the ways in which these technologies can actually be used are still limited. That is why we are focusing our efforts on researching interfaces that are appropriate to the application."

'Smart Hospital' project carried out in collaboration with Toyohashi Heart Center

As a part of this, Prof. Kitaoka and his colleagues are currently working on the development of a "voice-input medical-chart creation support system" that can be utilized at medical workplaces. This is a tool that creates medical charts on the spot. When doctors perform rounds, the system listens to pick up information on physical condition, body temperature, and so on from dialogues with the patients, and at the same time, it automatically converts the utterances into text, and understands and structures the information.
"This research is part of the 'smart hospital' project that the Toyohashi Heart Center (a hospital specializing in cardiovascular diseases) and Toyohashi University of Technology have been jointly carrying out since 2021, which has already led to the development of a prototype. For example, the system picks up comments such as, 'There are no particular problems with your physical condition," and 'You are on schedule to be able to leave the hospital next week,' automatically transcribes them, and then generates a summary of the information. Thus, this is a system that enables doctors to easily create medical charts, simply by selecting the relevant results for summary on the screen of a smartphone or tablet," explains Professor Kitaoka.

In fact, when they first began their joint research, they had no idea how speech recognition technology could be used in the medical field. Thus, before starting the development, they observed conditions on the ground at the hospital, held extensive discussions with doctors, and identified key challenges.
As Prof. Kitaoka explains, "Although there are already many existing types of software for preparing medical charts using voice input, what doctors needed was a system that could create records on the spot during their rounds. This is because doctors had previously been unable to input the information they had received from patients while visiting their rooms. They would have to take notes, and then input the information into personal computers in the hallways later on. Using our current system makes it possible to create medical charts while listening to patients on the spot, which saves time and effort. Going forward, we will work on linking our system with the electronic medical-chart systems that have already been introduced, and at the same time, we plan to perform fine tuning on our system to facilitate its wider use."

Utilizing ChatGPT for creation of training data and medical charts

In addition to being labor saving for doctors, the medical support systems also need to be capable of outputting correct results by learning the technical terms and information essential for medical practice in advance. The key to achieving this is to have training data to improve accuracy.
"In the smart hospital project," says Prof. Kitaoka, "we are also developing a diagnostic support system using CT images of the heart, but again, the creation of training data is a challenge. When creating training data, we have to avoid overloading the already busy doctors and technicians. As such, we have been attempting to proactively use the existing technologies that are applicable in this area. For example, we have been using the technology of the moment, ChatGPT, along with existing databases in order to summarize information that has been picked up during rounds, and to reinforce the training data."

Meanwhile, Prof. Kitaoka points out that the emergence of ChatGPT presents us an opportunity for fundamentally rethinking the nature of artificial intelligence research.
"It seems that even without expressly making an effort to reproduce the thoughts of humans in computers, if one has access to a large-scale language model, in other words vast amounts of data, one can produce dialogue that is human-like in text form.
In this sense, it appears that our research is standing at a crossroads. This is because we cannot keep up in terms of data volume. Thus, we intend to focus meticulously on our area of expertise, which is 'speech,' and thereby pursue a more natural dialogue between humans and machines."

Natural dialogues with CG character

In his quest to realize a more natural dialogue system, Professor Kitaoka has been focusing on "the timing of responses." Over a decade ago, he developed a chat dialogue system that incorporates the unique characteristics of human dialogue, such as conversational backchanneling that is well-timed or slightly interruptive, and instances of taking over the conversation midway through utterances of the conversation partner. In addition, he has been working on a system that uses machine learning to predict whether a conversation will continue or end based on the pitch of the other person's voice and intonation patterns, and on speech synthesis that can change the tone of the voice to express anger, joy, and other emotions.

3D computer graphics character 'Saya' on screen

"One area in which this research can be applied is the development of a dialogue system with the 3D computer graphic character 'Saya' which was launched in 2015 by the 3D computer graphic artist TELYUKA. Saya shot to fame when, in spite of her being an imaginary character, she was selected as the winner of one of the 'Miss iD' auditions for entertainers hosted by a publishing company. Saya is now able to interact with humans by incorporating speech recognition, text-to-speech, and chatting functions, as well as image recognition tech nology provided by Aisin (Kariya, Japan). She is equipped with functions that enable her to pick out 'focus words' when she is listening to someone speaking, and to change the topic of conversation after there has been an extended pause. Furthermore, we have continuously been making improvements so that she is able to express herself more naturally by conversational backchanneling and the conveyance of feelings."
Prof. Kitaoka adds that, going forward, his team hope to expand the Saya dialogue system to include giving street directions, as well as watching over elderly persons and children at facilities.

Prof. Kitaoka is still looking to the future, examining whether it will be possible to install the same system in anime characters other than Saya. "Machines can talk to humans indefinitely, without ever getting tired" he points out. "If you think about this point, you can see how the potential for applications will likely multiply."

Shifting to multimodal communication

If one is aiming for truly natural dialogue between humans and machines however, conversational ability alone is not enough. To this end, Prof. Kitaoka and his colleagues have been working on the development of a "multimodal dialogue" system, which means that they not only engage in communication through speech but also incorporate gestures such as body and hand motions, eye movements, and so on.

Multimodal communication enabled autonomous vehicle

"I worked on a similar application while at Tokushima University in 2018, where we jointly developed a multimodal interactive self-driving car in collaboration with Nagoya University and Aisin Seiki (now Aisin). The experience of using this system was designed to function something like giving instructions to a taxi driver. While the vehicle is in motion, it is possible to look at something and ask the vehicle what it is, and to instruct the vehicle to change directions through gestures."

Prof. Kitaoka has also been working on developing a system that can, for example recognize information written on a blackboard that is pointed to with one’s finger, such as angles during a math class for example, and then can input the information as symbols or formulas on a display.
"In the future, I believe that it may be possible for us to release our dialogue system as a toolkit that can be integrated into existing spoken dialogue applications," says Professor Kitaoka. If he can achieve this, people will surely be able to enjoy more natural dialogue with their smartphones and PCs, such as by using gestures. We look forward to future developments.

Reporter's Note

Prof. Kitaoka has been fascinated by personal computers since he was in elementary school, where they were introduced for the first time. Although he did not have a personal computer, his parents were sufficiently moved to see him copying programming notes on paper that they eventually bought him one. This led him towards his long held dream of a place in a natural language processing laboratory at university, but the overwhelming number of applicants there persuaded him to switch to speech research. Since then, he has never looked back.

He points out, "When I initially began my research, it was the middle of the second AI boom, and I wanted to find an elegant way to engage in dialogue with machines by giving the machines human-like intelligence. Now however, I have come to believe that 'human-like' qualities reside in behaviors that are expressed outside of language. Realizing these is the real challenge."

It is precisely in these difficult-to-quantify aspects that I believe the key lies in removing the sense of discomfort that arises when communicating with machines.

機械と人間の自然な音声対話をめざして

いまや、音声認識技術はスマートフォンやスマートスピーカーなどにも搭載され、仕事や生活のなかでさまざまに活用されている。しかし、どんな場面でも十分に使える、さらには人間と同じように対話できる、とまではいかない。この溝を埋めるべく、北岡教英教授は音声対話システムをさまざまな場面で「使える」ものにしようと研究している。自然な対話のポイントは、会話に必要な知識もさることながら、発話や相槌のタイミングなど、人間ならではの応答をいかにコンピュータ上で実現するかにあるという。

音声入力カルテ作成支援システムで、医療従事者の負担軽減を

Appleの「Siri」やAmazonの「Alexa」でおなじみのように、音声認識技術は、深層学習によるEnd-to-end学習（入力から出力を得るまでを一つの大きなニューラルネットワークモデルで行う際のネットワークの学習のこと）によって、ここ10年足らずで飛躍的に進展を遂げてきた。黎明期から30年にわたり音声認識研究に携わってきた北岡教英教授は、これからは音声認識自体の高精度化をめざすというより、どんな場面で使うのか、具体的な応用先を見出すことが重要だと語る。
「深層学習の進展、さらには大規模な音声データの活用により、音声認識自体の精度はかなり向上してきています。ただ、実際にはまだ使える場面は限られている。そこで、われわれはその応用先に応じたインタフェースの研究に注力しているのです」

その一つとして、現在、北岡教授らは医療現場で活用される「音声入力カルテ作成支援システム」の開発に取り組んでいる。これは医師らが回診時に、患者との対話を通して体調や体温などを聞き取る際、発話を自動で文字化し、その内容を理解・構造化して、その場でカルテを作成するツールだ。
「この研究は、2021年から医療法人澄心会（豊橋ハートセンター）と豊橋技術科学大学が共同で進めている『スマートホスピタル』プロジェクトの一環で、すでにプロトタイプを開発しました。たとえば、『体調はとくに問題ありません』『来週には退院できる予定です』といった発話を聞き取って自動で書き起こし、これらの内容の要約結果を自動で生成します。医師らは、その要約結果をスマホやタブレットの画面上で選択するだけで、簡単にカルテを作成できる、というシステムです」と北岡教授は説明する。

実は共同研究を始めた当初、医療現場のどういった場面で音声認識技術が使えるのか、まったく見当がつかなかったという。実際に病院の現場を見学し、医師らと議論を重ね課題を見つけ、開発に至ったという。
「すでに音声入力でカルテを作成するソフトは多数存在しますが、お医者さんたちが必要としていたのは、回診の際にその場で記録できるシステムだったんですね。というのも従来、患者さんから聞き取った内容を病室で入力するわけにいかず、メモをしておいて、後から廊下でパソコンに打ち込んでいたからです。このシステムを使えば、病室で聞き取りながらカルテが作成できるため、手間も時間も省ける。今後は、実際に導入される電子カルテシステムと連携を図りながら、現場で使えるシステムとしてチューニングしていく予定です」

ChatGPTを教師データやカルテ作成に活用

医療支援システムは、医師らがストレスなく使えることに加え、医療現場に不可欠な専門用語や情報をあらかじめ学習させておき、正しい結果を出力するものでなければならない。その際の鍵を握るのが、精度を高めるための教師データの存在だ。
「スマートホスピタル・プロジェクトでは、そのほかにも心臓のCT画像による診断支援システムの開発も手掛けているのですが、そこでも教師データの作成が課題です。教師データ作成のために、それでなくても忙しい医師や技師の方たちの手を煩わせるわけにはいきません。そこで使える既存の技術は積極的に使おうと、たとえば回診の聞き取り内容の要約や学習データの補強には、いま注目のChatGPTや既存のデータベースを活用しています」

一方で、ChatGPTの出現は人工知能研究のあり方を根本から問い直す、大きなきっかけにもなっていると北岡教授は指摘する。
「わざわざ人間の思考をコンピュータ上に再現しようとがんばらなくても、大規模言語モデル、すなわち大量のデータがあれば、それらしい対話がテキスト上ではできてしまうわけですよね。そういった意味では我々の研究も岐路に立たされている。データ量ではとういて及びませんからね。そこで、われわれはあくまでも自分たちの専門である『音声』にこだわって、人間と機械のより自然な対話を追究しようと考えています」

CGキャラクターとの自然な対話

より自然な対話システムの実現をめざすにあたり、北岡教授が早くから注目してきたのが「応答のタイミング」だ。すでに十数年前には、タイミングよく、あるいは少し食い気味に相槌を打ち、ときには相手の発話の途中で会話を奪うといった、人間特有の対話の特徴を織り込んだ雑談対話システムを開発。さらに機械学習を取り入れて、相手の声の高さやイントネーションのパターンから、会話がまだ続くのか、終わるのかを予測するシステムや、声のトーンを変え、怒りや喜びなどの感情を表現できる音声合成の研究も行なってきた。
「こうした研究の応用先の一つが、3DCGのSayaとの対話システムの開発です。これは、3DCGアーティストのTELYUKAさんによって2015年に発表されたキャラクターで、出版社主催のオーディション『ミスiD』において、架空キャラクターでありながら選出されたことで一躍注目を集めました。このSayaに、音声認識や音声合成、雑談機能、さらには『アイシン』（刈谷市）が提供する画像認識技術を搭載することで、人間との対話ができるようになりました。相手の話を聴く際には、焦点となる言葉を掘り起こしたり、沈黙が続いたら他の話題に切り替えたりといった機能も搭載されています。さらに、相槌や感情表現をより自然に表現できるよう、改良を重ねているところです」と北岡教授。

Sayaとの対話システムは今後、街頭での道案内や高齢者や子どもの施設での見守りなどへの展開をめざすという。さらには、Sayaに限らず、別のアニメキャラクターなどにもシステムを搭載できないかと、夢は広がる。
「機械であれば、長時間だろうがまったく疲れることなく人間の相手ができますからね。そう考えると、応用の場面はもっとうんと広がっていくと思います」

マルチモーダルなコミュニーションへ

もっとも、人間と機械のより自然な対話をめざすなら、会話だけでは不十分だ。そこで北岡教授らが取り組むのが、マルチモーダルの対話、すなわち音声による言葉だけでなく、身振り手振りなどのジェスチャーや視線の動きなども取り入れたシステムの開発である。
「応用の一つとして、わたしが徳島大学に在籍していた2018年に、名古屋大学とアイシン精機（現・アイシン）と共同で、マルチモーダル対話型自動運転車を開発しました。これは、タクシーの運転手に指示するような感覚で、走行中に視線を向けて『あれは何？』と尋ねたり、身振りで道順変更を伝えたりできるシステムです」

そのほかに、例えば数学の授業で『この角度』と板書を指差した箇所を認識し、ディスプレイ上に記号や式を入力できるといったシステムの開発も進めている。
「将来的には、私たちの対話システムを既存の音声対話のアプリなどに、一つの機能として組み込めるツールキットとして公開できないかと思っているのです」と北岡教授。そうなれば、スマホやパソコンともっと自然に、ジェスチャーなども交えながら対話を楽しめるようになるにちがいない。今後の展開に期待したい。

（取材・文＝田井中麻都佳）

取材後記

小学生の頃、誕生したばかりのパーソナルコンピュータに夢中になったという北岡教授。パソコンを持っていないのに、見様見真似でプログラミングを紙に書いていたら、見かねたご両親が買ってくれたのだという。それが高じて、大学では念願の自然言語処理の研究室に所属するも、希望者が重なって音声研究へ。以来、音声研究一筋できた。

「研究を始めた当初は第二次AIブームの最中で、もっとエレガントな方法で機械に人間のような知能を持たせて対話したいと思っていたのです。でもいまは、人間らしさというのは、むしろ言語外に表れる振る舞いに宿るのだと思うようになりました。その実現こそが難しいのですが」と、北岡教授。

まさにそうした定量化しにくい部分にこそ、機械とのコミュニケーションで生じる違和感を取り除くカギがあるように思う。

Share this story

Researcher Profile

Norihide Kitaoka

Norihide kitaoka received PhD degree in 2000 from Toyohashi University of Technology, Aichi, Japan. Since he started his career at Toyohashi University of Technology as a research associate in 2001, has been involved in speech information processing. He is currently a professor at the Department of Computer Science and Engineering.

More Information

Reporter Profile

Madoka Tainaka
Editor and writer. Former committee member on the Ministry of Education, Culture, Sports, Science and Technology Council for Science and Technology, Information Science Technology Committee and editor at NII Today, a publication from the National Institute of Informatics. She interviews researchers at universities and businesses, produces content for executives, and also plans, edits, and writes books.