Tags: floss, python, programming
One of the key tenants of open source development is to scratch your own itch, that is, to build something of use and value to their authors (compare that with commercial development --building something for a customer-- or research --new solutions to challenging problems--). However, for a project to survive and to attract that elusive second contributor, it is important to make the project useful to others. Which bring us to the problem of having some sort of user interface. Being a back-end person, I usually struggle creating them. Some of my projects (for example, see http://poissonmagique.net) remain in a state of "working back-end good enough for me". This is of course useful enough for me but a far cry from useful to anybody else. I have found recently some relatively new technology and workflow that might enable to get out of this local equilibrium: widgets for Jupyter Notebook.
Tags: floss, speech, java, art
When I started fiddling with speech technologies back in 2010, I was interested in doing a real-time part of speech tagging of radio broadcasts. That exceeded my technical knowledge on the subject but I managed to learn enough Sphinx4 to assemble an end-to-end speech alignment system that takes an audio book and aligns it to its source material. Such system is almost straight out of Sphinx4 examples but it took a bit of time configuring and finding the right models to make it work. I used the aligner to make a funky installation in the Foulab booth for the Ottawa Mini-Maker Faire 2010. Over the years I have received a few requests for the system and I released it open source this month. This blog post describes a little bit more about it, how I used it and how other people can use it for their own projects.
Tags: Open Source, Challenge, Git, GitHub
Since 2012 I have been participating in an inspiring open source challenge called "24 pull requests". This post is an extended version of a lighting talk I gave at the Observe, Hack, Make hackercamp in The Netherlands in 2013.
Tags: Machine Learning
A few years ago, I attended a very good talk about identifying influencers in social media based on textual features. To evalute the results, the researchers employed cross-validation, a very popular technique in machine learning where the train set is split in n parts (called folds). The machine learning system is then trained and evaluated n times, each time in all the training data minus one fold and then evaluated in the remaining fold. In that way it is possible to have evaluation results for an evaluation set of the same size as the train set without doing the "mortal sin" of evaluating in training. The technique is very useful and widely employed. However, that doesn't stop you from overfitting at the methodological level, meaning if you repeat multiple experiments over the same data you will get enough insights into it to "overfit" it. This methodological problem is quite common, so I decided to write it down. It is also not very easy to spot due to the Warm Fuzzy Feeling (TM) that comes with using cross validation. That is, many times we as practitioners feel that by using cross-validation we buy some magical insurance policy against overfitting.
Tags: Natural Language Processing, Computational Linguistics, Graduate School
A few years back my wife kindly hosted me at the McGill CS Graduate Student Seminar Series. It was a well attended, candid talk about how to succeed at incorporating natural language processing (NLP) techniques within graduate research projects outside NLP itself.
Given the nature of the talk, I did not find sharing the slides as I do for my other talks to be that useful. Instead I'm putting that content into this blog post.
Tags: PDF, MEncoder, YouTube
I have made the slides and audio recordings of the classes available on-line at the class' site (in Spanish, sorry) under a CC-BY-SA license. I have recently dovetailing the audio and PDF into videos which I'm uploading to a playlist in YouTube. In this blog post I want to describe the tools I used to record and create the final videos.
Tags: IRC, QA
If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.
A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)
The process involves (see the page on dev.tiki.org for plenty of details):
To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.
If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.
Tags: UIMA, Python
In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.
The original version of this post was at the (now defunct) Hack the Job Market blog from MatchFWD, an (also now defunct) local startup. I reproduce it here.
Tags: debian, sysadmin, encryption
Update: The situation seems to have improved, see this comment.
I use both in my desktop and laptop Debian testing (currently codenamed 'wheezy'). For backups, I have been using an AES256 encrypted ZFS external USB hard drive, with date-based snapshots and deduplication. I have two hard drives, one of which I keep off-site, doing weekly (or so) backups. I have all the same set-up in my laptop, and I have done some drill back-up recoveries from it (from the latest backup, not from the snapshots, as zfs-fuse doesn't support .snapshot folder, yet).
This solution has been working quite well until I went to
Argentina for teaching. There,
while I upgraded my Debian testing on the laptop, FUSE stop working
properly in strange ways. When I got back to Canada and tried to
backup my desktop, zfs-fuse was also non-working. The problem took me
quite a bit of work to fix, I'm documenting here the problem and its
solution (in a nutshell: loop-AES, in the package loop-aes-utils,
/bin/mount to a version that is incompatible with
FUSE, solution: don't use loop-AES and move to dm-crypt).