Corpus-based Technologies, Web Search
Many companies are developing statistically-based technologies which require large sets of training data. Our client is developing web search technology and they required a corpus of human-annotated query-result relevance data. They had quality issues with their long time vendor. Butler Hill Group was given a five week trial where we exceeded the current vendor on all quality metrics. From there, we have doubled the project size repeatedly and partnered with the client to improve specs, quality metrics and processes.
Language specification and data collection project, Text-to-speech
Some companies use in-house developers to extend products to new markets. For these teams we often provide language descriptions (e.g. definitions of character sets, lists of common and function words). In one of our recent projects, we were brought in to support a development team improving text-to-speech technology (used for example in directory assistance or navigation systems). We provided language and country-specific address specifications and a data corpus conforming to these specs. Focus was on describing genuine abbreviations expansions in this domain.
Lexicon Maintenance, Proofing Tools
Customers who wish to extend their proofing tools beyond the English-speaking market often require word lists and language specifications (e.g. descriptions of morphological variation) in new market languages. We were tasked with updating existing lexica to reflect current word usage in multiple languages. We helped develop guidelines for lexicon inclusion and analyzed end user data. Once the list of new words was determined, we imported the words using morphological templates to generate inflected forms.
Search Engine Development
Companies in the search space often require word lists and morphological specifications. We supported a client extending a search engine for a new and complex language. We determined efficient ways to create and annotate frequency-based wordlists and expand existing morphological rules. Collaboratively we developed effective processes that were then also used to enhance several existing languages and will be used in the future for more languages.
Our Clients:
- ATG
- Educational Testing Services
- Getty Images
- IBM Research, Natural Languages Processing
- Microsoft Live Search
- Microsoft Office and related groups including Natural Language Group, Tablet, Speech, Terminology
- Microsoft Research, Natural Language Processing
- Microsoft Ireland
- Nuance
- Vista Higher Learning
|