Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

Quick Prototyping Use Cases with Widgets in Jupyter Notebook

Sat, 02 Jan 2016 23:07:32 -0500

Tags: floss, python, programming

permalink

One of the key tenants of open source development is to scratch your own itch, that is, to build something of use and value to their authors (compare that with commercial development --building something for a customer-- or research --new solutions to challenging problems--). However, for a project to survive and to attract that elusive second contributor, it is important to make the project useful to others. Which bring us to the problem of having some sort of user interface. Being a back-end person, I usually struggle creating them. Some of my projects (for example, see http://poissonmagique.net) remain in a state of "working back-end good enough for me". This is of course useful enough for me but a far cry from useful to anybody else. I have found recently some relatively new technology and workflow that might enable to get out of this local equilibrium: widgets for Jupyter Notebook.

Building a Speech Aligner in Java with Sphinx4 and VoxForge models

Wed, 16 Dec 2015 01:38:27 -0500

Tags: floss, speech, java, art

permalink

When I started fiddling with speech technologies back in 2010, I was interested in doing a real-time part of speech tagging of radio broadcasts. That exceeded my technical knowledge on the subject but I managed to learn enough Sphinx4 to assemble an end-to-end speech alignment system that takes an audio book and aligns it to its source material. Such system is almost straight out of Sphinx4 examples but it took a bit of time configuring and finding the right models to make it work. I used the aligner to make a funky installation in the Foulab booth for the Ottawa Mini-Maker Faire 2010. Over the years I have received a few requests for the system and I released it open source this month. This blog post describes a little bit more about it, how I used it and how other people can use it for their own projects.

24 Pull Requests: What, Why and How

Sun, 06 Dec 2015 22:04:25 -0500

Tags: Open Source, Challenge, Git, GitHub

permalink

Since 2012 I have been participating in an inspiring open source challenge called "24 pull requests". This post is an extended version of a lighting talk I gave at the Observe, Hack, Make hackercamp in The Netherlands in 2013.

Overfitting Machine Learning Experiments: When Cross-validation is No Silver Bullet

Mon, 30 Nov 2015 13:07:43 -0500

Tags: Machine Learning

permalink

A few years ago, I attended a very good talk about identifying influencers in social media based on textual features. To evalute the results, the researchers employed cross-validation, a very popular technique in machine learning where the train set is split in n parts (called folds). The machine learning system is then trained and evaluated n times, each time in all the training data minus one fold and then evaluated in the remaining fold. In that way it is possible to have evaluation results for an evaluation set of the same size as the train set without doing the "mortal sin" of evaluating in training. The technique is very useful and widely employed. However, that doesn't stop you from overfitting at the methodological level, meaning if you repeat multiple experiments over the same data you will get enough insights into it to "overfit" it. This methodological problem is quite common, so I decided to write it down. It is also not very easy to spot due to the Warm Fuzzy Feeling (TM) that comes with using cross validation. That is, many times we as practitioners feel that by using cross-validation we buy some magical insurance policy against overfitting.

NLP survival tips for non-NLP Graduate Students

Wed, 25 Nov 2015 07:30:31 -0500

Tags: Natural Language Processing, Computational Linguistics, Graduate School

permalink

A few years back my wife kindly hosted me at the McGill CS Graduate Student Seminar Series. It was a well attended, candid talk about how to succeed at incorporating natural language processing (NLP) techniques within graduate research projects outside NLP itself.

Given the nature of the talk, I did not find sharing the slides as I do for my other talks to be that useful. Instead I'm putting that content into this blog post.

Turn your class audio and PDF into YouTube videos using Free Software Tools

Fri, 20 Nov 2015 08:31:15 -0500

Tags: PDF, MEncoder, YouTube

permalink

Last year I taught a graduate level, semester length class in Machine Learning over Large Datasets at Facultad de Matématica, Astronomía y Física de la Universidad Nacional de Córdoba, in Argentina.

I have made the slides and audio recordings of the classes available on-line at the class' site (in Spanish, sorry) under a CC-BY-SA license. I have recently dovetailing the audio and PDF into videos which I'm uploading to a playlist in YouTube. In this blog post I want to describe the tools I used to record and create the final videos.

Mining QA Pairs from IRC Logs Using Simple Heuristics and a Chat Disentagler

Mon, 22 Apr 2013 03:56:46 -0400

Tags: IRC, QA

permalink

If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.

A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)

The process involves (see the page on dev.tiki.org for plenty of details):

  • Downloading the logs
  • Normalizing the IRC logs across different IRC logging clients
  • Identifying users that are likely to have asked questions
  • Identify the threads where said users participated
  • Assembling the final corpus
  • Indexing

To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.

For the extraction step, I'm using a publically available implementation of Disentangling chat (2010) by Elsner and Charniak.

If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.

Using Apache UIMA Concept Mapper Annotator with Python via JPyPe

Wed, 28 Mar 2012 17:44:15 -0400

Tags: UIMA, Python

permalink

I have been using a lot of Python lately in work for a customer. Programming python has many positives, but when it comes to processing large amounts of text, I still choose Apache UIMA.

In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.

This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons.

The original version of this post was at the (now defunct) Hack the Job Market blog from MatchFWD, an (also now defunct) local startup. I reproduce it here.

Loop-AES blues on Debian testing

Fri, 22 Apr 2011 23:33:44 -0500

Tags: debian, sysadmin, encryption

permalink

Update: The situation seems to have improved, see this comment.

I use both in my desktop and laptop Debian testing (currently codenamed 'wheezy'). For backups, I have been using an AES256 encrypted ZFS external USB hard drive, with date-based snapshots and deduplication. I have two hard drives, one of which I keep off-site, doing weekly (or so) backups. I have all the same set-up in my laptop, and I have done some drill back-up recoveries from it (from the latest backup, not from the snapshots, as zfs-fuse doesn't support .snapshot folder, yet).

This solution has been working quite well until I went to Argentina for teaching. There, while I upgraded my Debian testing on the laptop, FUSE stop working properly in strange ways. When I got back to Canada and tried to backup my desktop, zfs-fuse was also non-working. The problem took me quite a bit of work to fix, I'm documenting here the problem and its solution (in a nutshell: loop-AES, in the package loop-aes-utils, diverts /bin/mount to a version that is incompatible with FUSE, solution: don't use loop-AES and move to dm-crypt).

Older Posts