While there are many similarities between speech technology and text based language technology development, there are also a lot of unique challenges to processing and producing speech data. The ambiguity introduced by homophones (words that are spelled differently but sound the same, making a “hard problem” very different from a “heart problem” in a medical record transcription context) and the speech data variety across different speakers are just the tip of the iceberg. People also tend to use different vocabulary when they talk compared to when they write; e.g., a customer might use more idiosyncratic words when they call a customer care center than when they write an email or comment on a company blog, since they are often biased by the immediate context of the company website and the “official” product terminology in the latter scenario. Also, speakers tend to adapt to their dialog partner, which is when cultural issues come into play like variations when talking to someone with a perceived age difference, or someone with a regional accent – characteristics that are much less apparent when entering a written dialog. All of these factors make speech a particularly intriguing focal point for new technologies and applications.
There is lots of energy around consumer focused mobile voice applications, driven by continued strong growth in smart phones worldwide and the fact that network capacity, database and storage capacities have finally caught up with the requirements of speech recognition and text-to-speech technology. As users become more and more attached to their mobile phones and devices, they expect them to really “work” for them and help them accomplish tasks in new ways and in new situations: Up to half of some mobile application’s usage occurs while driving, which calls for “eye free” and hands free modalities. Voice enabled applications are the answer, and need to move towards truly conversational interfaces. Interesting challenges are presented through the fact that users might use several voice enabled applications at the same time, so the ideal speech recognition engine can process voice input across different applications and domains rather than requiring the user to provide clunky voice commands for switching between constrained domain apps. We anticipate seeing increased efforts to fully integrate speech into the OS and grounding it in a broad coverage syntactic and semantic representation layer.
Enterprises, meanwhile, are looking to balance cost saving demands with customer experience requirements for their IVRs (interactive voice response systems, as in phone banking or call center routing systems.) Many large global corporations, from airlines to car rental agencies to health care providers, attended the conference to survey new solutions for efficiently handling incoming customer calls. Such a solution might include deep natural language processing (NLP), which is especially valuable for first-time caller and long tail scenarios. Beyond the general challenges of implementing a high quality automated system, linguistic and cultural challenges abound once you adapt the system for new markets. For example, a voice navigation system in the Japanese market might use a voice that sounds friendly, animated, and eager, demonstrating “helpfulness” – a style that would not work in the German market, where the same system might use a voice conveying “confidence” instead.
Another topic with a lot of buzz at SpeechTEK was Speech Analytics. While parts of these solution overlap with Text Analytics (mining unstructured text data and turning it into actionable insights), speech analytics has additional information to work with, from prosody (intonation, stress, and rhythm) and pitch to pauses in the utterance to how loud the speaker gets, which helps analyze and categorize data. One might think of this as “sentiment analysis with additional dimensions.” There are lots of valuable applications for speech analytics in the enterprise, e.g. mining customer data from call centers, and elsewhere as in the legal field and in forensics.
We at Butler Hill look forward to continued partnering with companies developing consumer and enterprise focused speech technologies – and we’ll see you at SpeechTEK 2011!