Tags: floss, speech, java, art
When I started fiddling with speech technologies back in 2010, I was interested in doing a real-time part of speech tagging of radio broadcasts. That exceeded my technical knowledge on the subject but I managed to learn enough Sphinx4 to assemble an end-to-end speech alignment system that takes an audio book and aligns it to its source material. Such system is almost straight out of Sphinx4 examples but it took a bit of time configuring and finding the right models to make it work. I used the aligner to make a funky installation in the Foulab booth for the Ottawa Mini-Maker Faire 2010. Over the years I have received a few requests for the system and I released it open source this month. This blog post describes a little bit more about it, how I used it and how other people can use it for their own projects.
I will try to very briefly sketch how the state of the art in speech recognition works. For a more comprehensive discussion I recommend Dan Jurafsky and James H. Martin's book Speech and Language Processing (shameless plug: Dan was part of my thesis committee). In ASR (Automated Speech Recognition), the input to the system is an audio file, a binary vector of samples taken from a sound wave. The output is a sequence of (English) words. The sound wave is segmented into parts corresponding to different sounds to which we ascribe meaning (physical vowels and consonants, called phones). That transformation of binary vectors into sequences of phones is called an Acoustic Model. While obtaining a sequence of phones might seem as it is all that there is to ASR, that is only a part of the puzzle: there are plenty of words that sound the same. To obtain actual (English) words out of the sequence of phones, another model is added, a Language Model that predicts the likelihood for each sequence of words. Now, these processes are error prone, so instead of working on a set sequence of phones and words, they operate on an interconnected network of most likely interpretations. The process of reading out the most likely sequence is called decoding. I like to think that process is similar to the way the Fremen see the future in Dune, but that won't make any sense to you unless you have read that novel (and if you haven't, drop this blog and go read it, it's so much worth it). (Note: I have chosen not to talk about noisy channel nor HMMs, but if you want to understand how ASR truly works, those are the concepts to start searching for on the Internet.)
The ASR package I used for this speech aligner is Sphinx4, the Java implementation of the Open Source Sphinx speech recognition engine. It started as a simple port but now it is their main implementation, also, it helps demystify that Java is by no means a slow language (for instance, most of the code we wrote for the Watson system as in Java and speed was of uttermost importance). The documentation on their website is excellent, for a deeper discussion I like this this white paper.
A Speech Recognizer is of not much use without trained acoustic and language models. The lack of publicly available data to train machine learning models is a problem I have been concerned for a while. Luckily, there are some excellent public domain projects to the rescue:
Project Gutenberg, that provides transcriptions of books in the public domain. (I'd like to stress the cultural importance of reasonable copyright terms, the current terms in US, life of the author + 75 years, due to the "Mickey Mouse laws" have created a "lost century" in terms of cultural heritage.)
Inspired by Project Gutenberg, the enthusiastic volunteers at LibriVox have been recording the Project Gutenberg works as Audio Books, also in the public domain.
Finally, trained on this trove of data, VoxForge provides the necessary models.
With the models in hand, building a speech aligner requires just a small Java main that loads the audio and text files and sets the audio sources for Sphinx to work over. Most of the system is specified in the config.xml, special care must be taken there to point to the model files and folders in a way Sphinx4 can find them. A special threshold class was needed for this application, but I will talk about it on a later post. Please note: I wrote this code 5 years ago, it is most probably using an old, outdated API, I am making it available because of its intrinsic utility.
There is a catch, thought, the alignment has to be done on pronounced words, not written ones. For example, if the text says "call me at 3PM" that will be pronounced something like "call me at three p m". Transforming written text into "text ready to be spoken aloud" is one of the many steps a Text-To-Speech (TTS) system does. I have fished the required bits from the Festival TTS system to make this work. This wasn't trivial and it required writing some custom scheme code. This might as well be the biggest contribution behind this project. On a GNU/Linux computer with Festival TTS installed, use the tokens_to_words.pl script to obtain the spoken words to align.
To use aligner, the easier way is to change the command-line in the build.xml (line 64) pointing to the wave file and text to align. Alternatively, you can build the classpath (or print it from ant) and execute Align directy.
With this aligner I distilled a vocabulary of word sample sounds (which I might release at some point) and use it to read aloud the Canadian copyright act. Missing words from the vocabulary (such as copyright) were replaced by random instrument samples (why? because I could, of course) and printed into an old dot matrix printer at the booth. No visitor was left undazzled.
Some potential next steps:
Use the aligner to produce diphone voices for Festival TTS.
Produce a larger word sound samples dictionary and make it available for other art projects.
Make a full Java end-to-end system using FreeTTS to replace the current Festival/Scheme dependency.