PORTLAND, Ore. SpeechWorks International is applying neural-network technology in a practical-speech-recognition system that achieves a new level of natural-speech interaction by letting telephone callers ask for information in ordinary, unstructured English. Federal Express Inc. is delploying the approach in a shipping-rate information system in cooperation with NextLink Interactive.
"We have designed SpeechWorks from the ground up to make it easy for developers like NextLink to create conversational interfaces to automated systems like FedEx's rate information. Our DialogModules encapsulate the intricacies of speech recognition into easy-to-use building blocks with which engineers can quickly build applications," said Mark Holthouse, vice president of technology at SpeechWorks.
Instead of forcing engineers to develop speech applications using a dedicated tool, the company provides a tool kit of premade DialogModules, with applications programming interfaces (APIs) to either a C-language library or a set of ActiveX controls. The intricacies of neural learning and other speech-recognition techniques remain hidden to programmers using the DialogModules, letting application developers integrate the building blocks quickly into their own environments.
Digestible elements
At the lowest level, DialogModules gather the raw speech signal into easily digestible processing segments. Most other speech systems arbitrarily divide raw speech signals into 20-ms segments and then use smart software to glue the fixed-sized segments into variable-sized pieces that correspond to spoken phonemes. In contrast, SpeechWorks DialogModules use neural learning to guess how phonemes naturally divide the speech stream into segments right from the start.
That results in "as much as five times less data to work with," said Holthouse.
Segmentation is learned by a neural network that assigns probabilities to its first-pass guesses as to how the continuous-speech signal is divided into separate phonemes. After smart segmentation, a hidden Markov model tests the segments against all the known phonemes, using standard Gaussian statistical distributions to identify the phonemes and assemble them into words. A probability lattice is generated, and a traditional grammar search outputs the segmented spoken words to the application.
The recognizer "compares the various possible paths through the probability lattice with the phonetic representation of the words in its vocabulary, providing a probability score for each," said Holthouse.
A range of general rules governing language models can be used to constrain the number of possible recognized words by determining which words are more likely to follow one another. Application developers can also introduce their own constraints and obtain a ranked list of the most likely word strings just spoken.
Prepackaged DialogModules can then identify specific utterances corresponding to queries made by the natural-language processing in the application.
Assume, for instance, that a "yes/no" answer is expected to an application query to the user. There are about 30 different ways that people can say "yes" (s "yep," "yeah," "correct" and the like) and about 20 different ways they can say "no." Application programmers can check all of the variations simultaneously by merely making a call to the "Yes/No" DialogModule.
Other premade DialogModules are available for quickly identifying telephone numbers, zip codes, dates, currency and item lists specific to an application. Engineers can also add their own DialogModules to capitalize on the natural-language sequences that make sense for specific applications. For instance, a United airlines reservation system will heavily weigh "Boston," the name of one of its hubs. At this level, even whole phrases can be recognized.
Tuning response time
Once the vocabulary and grammar have been perfected, SpeechWorks monitors behavior in real-time to fine-tune the response time of a deployed system. The tuning tools log every activity as it occurs, including the actual caller's speech. System administrators can use the "on-the-job" application logging to pinpoint problem areas so that the app improves its response over time. They can go directly to the calls that are most successful and compare them with those that are least successful.