Third International Conference on Spoken Language Processing (ICSLP 94)

Yokohama, Japan
September 18-22, 1994

A Multimodal Teleconferencing System Using Hands-Free Voice Control

D. A. Berkley, James L. Flanagan, K. L. Shipley, Lawrence R. Rabiner

AT&T Bell Laboratories, Information Principles Research Laboratory, Murray Hill, NJ, USA

This talk describes the design and implementation of a digital teleconferencing system that seamlessly integrates a number of speech and image processing technologies together with data transmission capability. The goal is to provide a variety of sophisticated communication features that are easy to learn and fairly natural to use. The system is called HuMaNet, for Human/Machine Network. The system is con- trolled totally and interactively in a hands-free manner using spoken commands. The HuMaNet system combines the technologies of speech coding, speech recognition, text-to-speech synthesis, and talker verification with autodirective microphone arrays, image compression, data and hypertext management to provide high-quality audio, image and video conferencing over basic-rate ISDN (Integrated Services Digital Network). The present public-switched transport capacity provides "2B-f-D", or two 64 kbits/sec circuit-switched channels (2B), and one 16 kbits/sec packet-switched channel (D). For teleconferencing, the audio can be coded using a wideband (7 kHz) speech coder at 32 kbps, and the video is coded at 128 kbps using a commercial video conferencing codec. Image compression is used to transmit still images at rates of 64-128 kbps (using 2D-subband coding). Adaptive camera control, for group teleconferencing, can be achieved using the speech detection capability of the autodirective microphone array, thereby automatically and rapidly repositioning the camera for new talkers in a group teleconference.

