Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

Programming With Computers, Partnering With Machines To Create Programs

Thu, 30 Jun 2016 12:21:25 -0400

Tags: academic, philosophical


I have been invited to write a book chapter on lexical choice for translators (contact me if you want to see a preprint). To get acquainted on this audience different from my usual computer science I read a few papers on professional translators use of technology. Two of them are quite interesting and I recommend them not only because they make for a good read and they have implications outside translation: Translation Skill-sets in a Machine-translation Age by Anthony Pym (2013) and Is Machine Translation Post-editing Worth the Effort?: A Survey of Research into Post-editing and Effort by Maarit Koponen (2016). This search finished by reading a short ebook by researchers at the MIT Center for Digital Business titled Race Against the Machine: How the Digital Revolution Is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. In that book plus the papers there's this call for humans, if we want to remain employed, to hybridize our work and to seek out ways to work with the computer as some sort of partnership. That process is clear in human translation: checking from previously translated similar sentences or the output of machine translation (instead of creating new translations from scratch).

The question is then what about our trade? What it means to be working on a partnership with the computer rather than for the computer? As other people, I have argued that machine learning (more specifically supervised learning) is akin to traditional programing (in the old soft computing style). It follows many of the pros and cons of the redefined labor of the human translators.

But that's not all. Other areas of programmer / computer partnership that are less deployed (but nonetheless quite explored in the scientific space) are declarative programming techniques for both program verification and program synthesis using automatic theorem provers. The idea here is that instead of writing test cases you write test cases generators and the property checkers for the output of your programs over those generated test cases. I have experience with the Haskell library QuickCheck2 and it's quite pleasant to use (Thanks to teaching me how to use that library, gracias che!). There are now similar libraries for other programming languages. How can this be described as a programmer / computer partnership, you might ask? At the end is just another test framework. The difference is in the type of task the human is doing (enunciating properties) and the computer (doing the grunt of checking the said properties). Traditional unit testing has much more grunt work on the side of the programmer.

That focus on overall properties rather than the code behind it bring us to the hope of automatic programming using theorem provers. There has been some massive improvement in theorem prover capabilites using general SAT solvers in recent years. Maybe it's time this new technology start finding its way into the desktops of professional developers.

Now these skills are different from regular developers. The same can be said from machine learning. Many great practitioners in machine learning ("data scientists") are average / poor developers but come with backgrounds in engineering or science that makes them thrive in an extended programming task considering supervised machine learning as programming. It reminds me of the fact (as brought by Race Against The Machine) that the best chess players in present times are neither humans nor computer but a thriving partnership of not necessarily the best humans nor the best computers.

Borrowing a page from the experience of human translators, there'll be a time when painstaking 100% human created programs will be deemed too expensive for most but few mission critical situations. And the rest will be created by a redefined computer professional. At this stage this is a mental exercise but given the example from human translators, definitely an exercise worth engaging.

Using Recommendation Systems To Bring A Human Touch To Billions of Humans

Tue, 03 May 2016 03:55:31 -0400

Tags: academic, philosophical


Since I was born, the planet population increased by 50%. (I even heard half the humans ever existed are alive, that's false, more like 6%.) This is all anecdotal but my recollections from childhood speak of a place with just fewer people, where shopkeepers know you by name and expected you to buy certain items regularly. They would know what you like and bring products catering to their audience. Such experience for the most part is lost (it might remain in small towns and such).

Interestingly, human population was rather small for tens of thousands of years. Our human expectations about relating to each other are in line with small communities. This topic is outside my realm, but I heard about it before, Google points to an Urban Ecology paper from 1978 that says

the persistent human propensity to identify with small groups is a consequence of our evolution as a mammal

A machine learning / information retrieval technique that has grown immensely in popularity in the last two decades are recommendation systems. When teaching them before (see my class on the topic, in Spanish), I realized they bring that much needed small village feeling to on-line transactions. When you enter Amazon or Netflix, they "know" you and recommend things based on what they know about you. It is paradoxical that we now need computers to bring a much needed human touch to our interactions.

Moreover, some of the techniques used in recommendation systems (such as user-based recommender), build such small villages as part of their algorithms. In that sense, when Netflix recommends you to watch Anastasia for the fourth time, it actually has enough information to recommend you other people who, you never know, might actually want to form a small village with you.

What Random Forests Tell Us About Democracy

Thu, 21 Apr 2016 05:15:16 -0400

Tags: academic, political


A popular method for learning from large data sets is Random Forests (see my class on the topic, in Spanish). I would like to drive a paralellism between the way they work and our political decision structures and the so called Wisdom of the crowd.

Random Forests are what is called an ensemble method as they perform better than individual methods by combining their results. The individual method used in Random Forests are Decision Trees, trained from a subset of all the available data (and because of this property of operating on subsets of the data, they are a good method for applying on large datasets).

More interestingly, Random Forests (as discussed in the Machine Learning article by Leo Breiman in 2001), can not only train each of their trees on a subset of the data but also use a subset of the available information (features) when training each decision node in the tree. That makes each of the trees that are part of the ensemble truly random! When creating each individual tree we only see a subset of the data and only a subset of its characteristics. To decide the outcome of the decision, each of these random trees is given a vote. The most voted decision wins.

Now, the "magical" part is that they perform better than a decision tree trained on all available data. Even if the tree were made "smart" by prunning poorly constructed branches (the trees that make the ensemble are unpruned). And they are so high performant that a recent comparative study of 179 different classifiers found them to be consistently top performing across a large set of problems.

Now, if you think for a second, this is the way direct democracy works: each voter has access to a subset of the information and only sees that subset from a particular perspective (their own unique perspective). By using a majority vote, we are actually implementing a Random Forest. And from the theory (Breiman paper is quite delightful) we can see that we don't need more informed voters, just more of them. Food for thought.

The Spouse of the PhD Candidate

Tue, 05 Apr 2016 03:53:31 -0400

Tags: personal


Many moons ago, I did my PhD. That was years of hard-ship, unknowns, anxiety. But also self-discovery, with plenty of fun and exciting times.

Then I met a wonderful woman, we got married and she decided to go back to school and pursue a PhD on her own. I have to admit the process of accompanying a loved one through graduate school is far, far worse than going through it yourself. Even knowing full well the challenges ahead, it is much worse to have to see her suffer through them without being able to do anything about it.

The feeling of helplessness while watching a person you care for deeply being in distress is nothing like I have experienced before. We were lucky the decision of going back to graduate school was well thought out and discussed at length even before she applied. But even then there were years the whole process took a terrible toll on our marriage.

It is thus I want to extend my congratulations to Dr. Ying for successfully defending her PhD thesis entitled "Code Fragment Summarization"; to all other PhD candidate spouses out there, I hear your pain. There is light at the end of the tunnel!

Turning Your Class Project Into a Published Academic Paper

Thu, 18 Feb 2016 08:09:39 -0500

Tags: academic, research


In recent years, teaching has been to me a great source of satisfaction. A common situation when teaching is for a student to want to turn a class project into a full-fledged research publication. I have decided to put my ideas on the topic in this blog posting for helping students in such situation. This is also relevant to me as this year marks the 20th anniversary of my first international publication, which started as a class project.

First and foremost, for the students reading this, congratulations on successfully completing an outstanding class project! Even if the path to publication is arduous and can be discouraging, the fact you are entertaining the idea of publishing your work to a larger audience is a success on itself. Now, there are as many reasons to publish as there are class projects. Some are better than others, let's look at them in turn.

Quick Prototyping Use Cases with Widgets in Jupyter Notebook

Sat, 02 Jan 2016 23:07:32 -0500

Tags: floss, python, programming


One of the key tenants of open source development is to scratch your own itch, that is, to build something of use and value to their authors (compare that with commercial development --building something for a customer-- or research --new solutions to challenging problems--). However, for a project to survive and to attract that elusive second contributor, it is important to make the project useful to others. Which bring us to the problem of having some sort of user interface. Being a back-end person, I usually struggle creating them. Some of my projects (for example, see remain in a state of "working back-end good enough for me". This is of course useful enough for me but a far cry from useful to anybody else. I have found recently some relatively new technology and workflow that might enable to get out of this local equilibrium: widgets for Jupyter Notebook.

Building a Speech Aligner in Java with Sphinx4 and VoxForge models

Wed, 16 Dec 2015 01:38:27 -0500

Tags: floss, speech, java, art


When I started fiddling with speech technologies back in 2010, I was interested in doing a real-time part of speech tagging of radio broadcasts. That exceeded my technical knowledge on the subject but I managed to learn enough Sphinx4 to assemble an end-to-end speech alignment system that takes an audio book and aligns it to its source material. Such system is almost straight out of Sphinx4 examples but it took a bit of time configuring and finding the right models to make it work. I used the aligner to make a funky installation in the Foulab booth for the Ottawa Mini-Maker Faire 2010. Over the years I have received a few requests for the system and I released it open source this month. This blog post describes a little bit more about it, how I used it and how other people can use it for their own projects.

24 Pull Requests: What, Why and How

Sun, 06 Dec 2015 22:04:25 -0500

Tags: Open Source, Challenge, Git, GitHub


Since 2012 I have been participating in an inspiring open source challenge called "24 pull requests". This post is an extended version of a lighting talk I gave at the Observe, Hack, Make hackercamp in The Netherlands in 2013.

Overfitting Machine Learning Experiments: When Cross-validation is No Silver Bullet

Mon, 30 Nov 2015 13:07:43 -0500

Tags: Machine Learning


A few years ago, I attended a very good talk about identifying influencers in social media based on textual features. To evalute the results, the researchers employed cross-validation, a very popular technique in machine learning where the train set is split in n parts (called folds). The machine learning system is then trained and evaluated n times, each time in all the training data minus one fold and then evaluated in the remaining fold. In that way it is possible to have evaluation results for an evaluation set of the same size as the train set without doing the "mortal sin" of evaluating in training. The technique is very useful and widely employed. However, that doesn't stop you from overfitting at the methodological level, meaning if you repeat multiple experiments over the same data you will get enough insights into it to "overfit" it. This methodological problem is quite common, so I decided to write it down. It is also not very easy to spot due to the Warm Fuzzy Feeling (TM) that comes with using cross validation. That is, many times we as practitioners feel that by using cross-validation we buy some magical insurance policy against overfitting.

NLP survival tips for non-NLP Graduate Students

Wed, 25 Nov 2015 07:30:31 -0500

Tags: Natural Language Processing, Computational Linguistics, Graduate School


A few years back my wife kindly hosted me at the McGill CS Graduate Student Seminar Series. It was a well attended, candid talk about how to succeed at incorporating natural language processing (NLP) techniques within graduate research projects outside NLP itself.

Given the nature of the talk, I did not find sharing the slides as I do for my other talks to be that useful. Instead I'm putting that content into this blog post.

Turn your class audio and PDF into YouTube videos using Free Software Tools

Fri, 20 Nov 2015 08:31:15 -0500

Tags: PDF, MEncoder, YouTube


Last year I taught a graduate level, semester length class in Machine Learning over Large Datasets at Facultad de Matématica, Astronomía y Física de la Universidad Nacional de Córdoba, in Argentina.

I have made the slides and audio recordings of the classes available on-line at the class' site (in Spanish, sorry) under a CC-BY-SA license. I have recently dovetailing the audio and PDF into videos which I'm uploading to a playlist in YouTube. In this blog post I want to describe the tools I used to record and create the final videos.

Mining QA Pairs from IRC Logs Using Simple Heuristics and a Chat Disentagler

Mon, 22 Apr 2013 03:56:46 -0400

Tags: IRC, QA


If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.

A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)

The process involves (see the page on for plenty of details):

  • Downloading the logs
  • Normalizing the IRC logs across different IRC logging clients
  • Identifying users that are likely to have asked questions
  • Identify the threads where said users participated
  • Assembling the final corpus
  • Indexing

To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.

For the extraction step, I'm using a publically available implementation of Disentangling chat (2010) by Elsner and Charniak.

If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.

Older Posts