INF5820 - Autumn 2012

Spoken Dialogue Systems: syllabus

This page details the general structure of the first part of the course, devoted to Spoken Dialogue Systems.

This is the Nao robot (from Aldebaran Robotics) that will be used in the first assignment, to demonstrate how to develop spoken dialogue systems for real applications.

Spoken dialogue systems are defined are computer systems designed to interact with humans using everyday spoken language, usually in order to accomplish specific, practical tasks. The goal of the first part of the course will be to present the most important technologies, algorithms and frameworks used in this rapidly developing research area. We'll detail how spoken dialogue works, how to process and generate it, and how to transfer these ideas into real systems. We'll start with a general overview on dialogue systems and their applications, and then look more closely at individual components, such as speech recognition, dialogue understanding and management, generation and speech synthesis.

The application domain which we will investigate in more detail this year is human-robot interaction, i.e. the design of (physically embodied) robotic systems able to communicate with humans in an intuitive way, using spoken language and other modalities. One assignment will precisely consist in developing a simple interaction module and integrating it in a small humanoid robot.

The course will cover the following topics:

Introduction to spoken dialogue systems: What is a spoken dialogue system, and what is it good for? We'll provide a brief introduction to the research field and the variety of practical applications, some already turned into commercial products, others being at the development stage.
Generalities about spoken dialogue: What is dialogue anyway, and how does it work? How are human-human dialogues characterised from a linguistic perspective? What are the cognitive processes underlying our abilities for verbal and non-verbal interaction?
Dialogue System Architectures: What are the main software architectures for spoken dialogue Systems? What are the building blocks of these technical systems, and how are they interconnected?
Case study: Human-robot interaction: We'll also focus on one application in more detail, namely human-robot interaction. We'll briefly describe the main architectures and functionalities of robotic systems, and the place of "interactive" components such as dialogue systems within them. We'll also mention some of the core challenges that are likely to arise when one tries to build such "talking robots".
Probabilistic modelling: In order to understand the most important algorithms and frameworks used in dialogue architectures, we need to first review some core concepts from AI/NLP, and more specifically the use of probabilistic graphical models such as Bayesian Networks.
Speech recognition: How does speech recognition work? How to estimate the acoustic and language models? What kind of speech features are extracted? How to adapt the estimated models to the user and environment?
Dialogue understanding: How to map noisy hypotheses from speech recognition to their intended meanings? How to deal with disfluencies, resolve references, perform dialogue-level interpretation?
Models for decision-making: We'll see how to extend standard probabilistic models with decision-making information (utilities of various actions), and how to exploit such information to select optimal actions, learn from reinforcements, and perform forward planning.
Dialogue management: We'll investigate how to practically develop the "decision-making" component of a dialogue system, namely the dialogue manager. How are these components developed, and how can they be optimised (based on the decision-making models seen in the previous section)? We'll review some state-of-the-art techniques, which often cast the problem as an instance of "planning under uncertainty", well known in AI.
Generation: Once the dialogue manager has decided to say something, how do we map this abstract representation of a dialogue act onto surface words? In other words, how do chose how to present a given information, and realise it at the surface level?
Speech synthesis: The last component of a dialogue system is the speech synthesiser (text-to-speech), which converts a string of words onto a speech signal. We'll analyse how this operation is performed in practice, using different methods. We'll also describe some recent work seeking to "modulate" the generated signal (e.g. its intonation), notably to express information structure or emotional states.
System evaluation: Finally, we'll try to understand how we can evaluate the performance of a dialogue system according to specific (intrinsic & extrinsic) metrics.

This part of the course should last about 6 weeks, and will comprise both a theoretical part (lectures) and a practical part (exercise sessions). See the course plan for details. There will also be two obligatory assignments. The first assignment will consist of various exercises related to the material seen in the first four weeks of the course. The second assignment will take the form of a small project where students will develop and test a small dialogue domain for human-robot interaction. Further information will be provided during the course.

Published July 13, 2012 11:03 AM - Last modified Aug. 9, 2012 9:37 AM