Today’s speech transcription systems are built on technology that was originally developed for English and a small subset of world languages, with markedly lower performance on other languages. With more than 7 000 languages in the world, methods to build speech recognition technology for a much larger set of languages are required. The BABEL project was an international collaborative project aimed at solving the spoken term detection task in previously unstudied languages.
International collaboration
The BABEL project was initiated and funded by the Intelligence Advanced Research Projects Activity (IARPA) of the United States Government. Initially four consortia participated in the project, led by IBM, Carnegie Mellon University (CMU), the International Computer Science Institute (ICSI), and Raytheon BBN Technologies (BBN), respectively. Each consortium consisted of five to seven partners, from both industry and academia from around the world.
MuST (Multilingual Speech Technologies), a research niche area of the Faculty of Economic Sciences and Information Technology of the NWU's Vanderbijlpark Campus, was invited to participate in the BabelOn consortium, led by BBN. Partners were MIT and Johns Hopkins University (USA); Brno University of Technology (Czech Republic); and LIMSI and Vocapia Research (France). According to Prof Marelie Davel, Director of MuST, this was a great opportunity: “We had an excellent team to work with and were able to learn much, while contributing specific knowledge we had of working in under-resourced, multilingual environments”.
More about the project
The goal of the BABEL project was to develop methods to build speech recognition technology for a much larger set of languages than had ever been addressed before. The project required innovations in how to rapidly model a novel language with significantly less training data that was also much noisier and more heterogeneous than that used in current state-of-the-art approaches. BABEL's technical measures of success were focused on how well effective word-based searches could be performed in noisy speech in the languages being investigated.
BABEL’s most ambitious goal was to demonstrate the ability to generate a speech transcription system for any new language within one week and to support keyword search for effective triage of massive amounts of speech recorded in challenging real-world situations. After practicing on a variety of languages each year, consortia were evaluated on an annual surprise language with limited resources and time to build speech transcription systems.
The entire project was organised as an annual challenge, with transcribed speech data for 'development languages' made available during the course of the year. The number of languages per year kept increasing, in order to encourage teams to develop language-independent technologies. At the end of each year, a 'surprise language' was released, and each team had a limited amount of time to create a fully-fledged speech recognition and spoken term detection system for that language. This approach was very effective and provided an element of friendly but fierce competition over the four and a half years of the project. It also helped to build partnerships and networks, many of which still exist now.
All the research staff at MuST were actively involved in the project. MuST focused on pronunciation modelling and creating subword units for subword-based modelling, using minimal resources.
A succesful project
The success of the project can be measured in the following way. Before the project was initiated the number of languages that could be transcribed and monitored for recognition of keywords was limited and previously between 100-1000 hours of language recordings were required to be able to develop the transcription systems and keyword recognition. By the end of the project it only required 10-40 hours of recordings and these recordings did not have to be made in perfect recording situations but could be made from variable recording situations such as from home offices (landline or mobile), public places, on streets, in vehicles and from phone car kits. Most significant of all was that the time to develop a transcription system for a new language was originally several months to a year but at the end of this project it was down to one week or less. In fact, the final system created by the BabelOn team could be built in 2.5 days with no knowledge of the language before that point.
Only selected consortia were funded at the end of each year’s challenge, with teams eliminated based on annual performance. During the final challenge (2016), only the BBN and IBM teams, remained, with the BabelOn team (BBN) achieving the highest accuracies during the final evaluation, with a significant margin. All in all, when the BABEL project ended in September 2016, an astounding number of new techniques had been created and demonstrated, making speech technology systems in under-resourced languages a reality.
Members of the BabelOn team at the final Principal Investigator’s meeting in Baltimore, Maryland.. From the left: Karthik Narasimhan (MIT), John Makhoul (BBN), Sri Harish Mallidi (JHU), Damianos Karakos (BBN),
Richard Hsiao (BBN), Hynek Hermansky (JHU), Tanel Alumae (BBN), William Hartmann (BBN), Marelie Davel (NWU MuST), Neil Kleynhans (NWU MuST), Stavros Tsakalidis (BBN), František Grézl (BUT),
Lori Lamel (LIMSI), Richard Schwartz (BBN), Jean-Luc Guavain (LIMSI) and Martin Karafát (BUT).