Started tinkering with speech technologies

Started tinkering with speech technologies

Wed, 03 Nov 2010 06:51:07 -0400

Tags: scientific, hackerspace, floss


Last week has been pretty intense putting a somewhat academic knowledge of speech technologies into action. My previous real world experience was maintaining and integrating a concept-to-speech system for a complex, medical report system back in graduate school. That of course entailed to listening to hours and hours of computer-generated speech. Joy of joys *grin*. The good part is, once you get used to it, learning emacspeak was not so difficult (but that is a subject for another post). As such, even though I have been quite literally surrounded by speech scientists in the last ten years (I had two in my thesis committee), I had never taken the time to get properly acquainted with the two main tasks of speech technologies: training an ASR and building a speech synthesis voice.

Interestingly, there has been an art installation project I have been toying with the idea for a while. Its provisional name so far is "The Talk of The Town" (TToTT for short) and it involves GNU Radio, some very light speech recognition and words concatenation. The idea is to switch on a radio and listen to small segments from different radio stations as if you were browsing the airwaves. The key part is that the small audio segments from each radio station are actually complete, discernible words and the different words coming from the different stations made up a syntactically correct message. The semantics of the message are to be meaningless in general but some emergent message my appear due to the similarity of topics on airwaves (thus the name, "The Talk of The Town"). This project is in early stages and from my experience in the past week, it won't be finished anytime soon. I'm keeping the relevant material at (there is nothing there beyond the above description at the moment).

Even though the idea of building TToTT has been making circles in my thoughts for a while, what triggers start working on it has been Foulab and the upcoming mini Maker Faire Ottawa and my desire to have something to share at the show.

So I decided to do some proof of concept using CMU Sphinx and a bunch of Ottawa podcasts (to make it local). Getting CMU Sphinx to work was an insane amount of work. I finally went for CMU Sphinx 4, which is written in a language I am very familiar with (Java) and has the best engineering so far. Still the models available through the Sphinx project are really lacking (5,000words) or incomplete (as far as I could tell). Having dealt with the realities of research corpora distributions myself, I think it is really great the people behind Sphinx put all this effort (and I am hoping I can contribute something to the project in some moment) but without the source speech, it is very difficult to troubleshoot the ASR. And I don't have $25k to have a membership to the Linguistic Data Consortium, where their training data originates. Initially, I went around in circles using the default WSJ model thinking I had something wrong on my setup until I found this page on their site:

A simple Sphinx-4 application that transcribes a continuous audio file that has multiple utterances and creates a lattice using the results. The default file, called "10001-90210-01803.wav", contains three utterances, separated by silences. Please note that the recognition accuracy will be very poor due to the choice of wav file.


I heard: what's there owns their o. zero one


I heard: dynamics to one oh


I heard: zero one eight searle three

So you feed numbers in and get this output back (a WER of 50% or more --word error rate). Clearly the model is really lacking but without this official confirmation it is very easy to assume otherwise.

But hope is not lost. ASR technology is very important, particularly for people with disabilities and the good folks at VoxForge have assembled a really nice 40,000 ASR model with LDA features (this is among the things I learned this week, it has been a long week!). Using it made things a little better but I was still getting a WER (word error rate) in excess of 70% with the Ottawa podcasts. The word segmentation was not correct, even.

Finally, I turned into LibriVox, where volunteers read Project Gutenberg e-books and put the resulting recording into the public domain. As I was saying Sphinx4 is really very nicely engineered so it can easily do alignment instead of recognition. My initial test (Chapter I of The Golden Snare by Oliver Curwood) has turned astonishingly good results, although it required some manual trimming of the input text and waves. I am now planning into streamlining this manual process in because this aligned corpus can be fantastic for a variety of uses. For instance, it can provide high quality speech voices and improved ASR for the Free and Open Source world (which is in dire need of nice, user friendly, well trained tools). Which brings me to the last item this week: speech synthesis.

Happy with the alignments I am getting from CMU Sphinx4 and LibriVox, I started looking into Festival for a better blending of the word samples. The FestVox project deals with adding new voices to Festival and it includes a 200+ pages WIP book detailing the process. So I have been digesting it chapter after chapter and deciding where to go from here.

In the mid-time, for this weekend demo, at Foulab we decided to go for a demo involving coherent speech and we settle down for the voicing the Canadian Copyright Act as it fits the tone of a Maker Faire in Ottawa. My personal target is to voice it with five different voices from LibriVox, so time is tight. Last but not least, if you have a nice voice, time to give back to humanity and reasonable recording accommodations, consider to contribute to VoxForge and LibriVox.


Your name:

URL (optional):

Your e-mail (optional, won't be displayed):

Something funny using the word 'elephant' (spam filter):

Your comment: