Tags: academic, philosophical
Since I was born, the planet population increased by 50%. (I even heard half the humans ever existed are alive, that's false, more like 6%.) This is all anecdotal but my recollections from childhood speak of a place with just fewer people, where shopkeepers know you by name and expected you to buy certain items regularly. They would know what you like and bring products catering to their audience. Such experience for the most part is lost (it might remain in small towns and such).
Interestingly, human population was rather small for tens of thousands of years. Our human expectations about relating to each other are in line with small communities. This topic is outside my realm, but I heard about it before, Google points to an Urban Ecology paper from 1978 that says
the persistent human propensity to identify with small groups is a consequence of our evolution as a mammal
A machine learning / information retrieval technique that has grown immensely in popularity in the last two decades are recommendation systems. When teaching them before (see my class on the topic, in Spanish), I realized they bring that much needed small village feeling to on-line transactions. When you enter Amazon or Netflix, they "know" you and recommend things based on what they know about you. It is paradoxical that we now need computers to bring a much needed human touch to our interactions.
Moreover, some of the techniques used in recommendation systems (such as user-based recommender), build such small villages as part of their algorithms. In that sense, when Netflix recommends you to watch Anastasia for the fourth time, it actually has enough information to recommend you other people who, you never know, might actually want to form a small village with you.
Tags: academic, political
A popular method for learning from large data sets is Random Forests (see my class on the topic, in Spanish). I would like to drive a paralellism between the way they work and our political decision structures and the so called Wisdom of the crowd.
Random Forests are what is called an ensemble method as they perform better than individual methods by combining their results. The individual method used in Random Forests are Decision Trees, trained from a subset of all the available data (and because of this property of operating on subsets of the data, they are a good method for applying on large datasets).
More interestingly, Random Forests (as discussed in the Machine Learning article by Leo Breiman in 2001), can not only train each of their trees on a subset of the data but also use a subset of the available information (features) when training each decision node in the tree. That makes each of the trees that are part of the ensemble truly random! When creating each individual tree we only see a subset of the data and only a subset of its characteristics. To decide the outcome of the decision, each of these random trees is given a vote. The most voted decision wins.
Now, the "magical" part is that they perform better than a decision tree trained on all available data. Even if the tree were made "smart" by prunning poorly constructed branches (the trees that make the ensemble are unpruned). And they are so high performant that a recent comparative study of 179 different classifiers found them to be consistently top performing across a large set of problems.
Now, if you think for a second, this is the way direct democracy works: each voter has access to a subset of the information and only sees that subset from a particular perspective (their own unique perspective). By using a majority vote, we are actually implementing a Random Forest. And from the theory (Breiman paper is quite delightful) we can see that we don't need more informed voters, just more of them. Food for thought.
Many moons ago, I did my PhD. That was years of hard-ship, unknowns, anxiety. But also self-discovery, with plenty of fun and exciting times.
Then I met a wonderful woman, we got married and she decided to go back to school and pursue a PhD on her own. I have to admit the process of accompanying a loved one through graduate school is far, far worse than going through it yourself. Even knowing full well the challenges ahead, it is much worse to have to see her suffer through them without being able to do anything about it.
The feeling of helplessness while watching a person you care for deeply being in distress is nothing like I have experienced before. We were lucky the decision of going back to graduate school was well thought out and discussed at length even before she applied. But even then there were years the whole process took a terrible toll on our marriage.
It is thus I want to extend my congratulations to Dr. Ying for successfully defending her PhD thesis entitled "Code Fragment Summarization"; to all other PhD candidate spouses out there, I hear your pain. There is light at the end of the tunnel!
Tags: academic, research
In recent years, teaching has been to me a great source of satisfaction. A common situation when teaching is for a student to want to turn a class project into a full-fledged research publication. I have decided to put my ideas on the topic in this blog posting for helping students in such situation. This is also relevant to me as this year marks the 20th anniversary of my first international publication, which started as a class project.
First and foremost, for the students reading this, congratulations on successfully completing an outstanding class project! Even if the path to publication is arduous and can be discouraging, the fact you are entertaining the idea of publishing your work to a larger audience is a success on itself. Now, there are as many reasons to publish as there are class projects. Some are better than others, let's look at them in turn.
Tags: floss, python, programming
One of the key tenants of open source development is to scratch your own itch, that is, to build something of use and value to their authors (compare that with commercial development --building something for a customer-- or research --new solutions to challenging problems--). However, for a project to survive and to attract that elusive second contributor, it is important to make the project useful to others. Which bring us to the problem of having some sort of user interface. Being a back-end person, I usually struggle creating them. Some of my projects (for example, see http://poissonmagique.net) remain in a state of "working back-end good enough for me". This is of course useful enough for me but a far cry from useful to anybody else. I have found recently some relatively new technology and workflow that might enable to get out of this local equilibrium: widgets for Jupyter Notebook.
Tags: floss, speech, java, art
When I started fiddling with speech technologies back in 2010, I was interested in doing a real-time part of speech tagging of radio broadcasts. That exceeded my technical knowledge on the subject but I managed to learn enough Sphinx4 to assemble an end-to-end speech alignment system that takes an audio book and aligns it to its source material. Such system is almost straight out of Sphinx4 examples but it took a bit of time configuring and finding the right models to make it work. I used the aligner to make a funky installation in the Foulab booth for the Ottawa Mini-Maker Faire 2010. Over the years I have received a few requests for the system and I released it open source this month. This blog post describes a little bit more about it, how I used it and how other people can use it for their own projects.
Tags: Open Source, Challenge, Git, GitHub
Since 2012 I have been participating in an inspiring open source challenge called "24 pull requests". This post is an extended version of a lighting talk I gave at the Observe, Hack, Make hackercamp in The Netherlands in 2013.
Tags: Machine Learning
A few years ago, I attended a very good talk about identifying influencers in social media based on textual features. To evalute the results, the researchers employed cross-validation, a very popular technique in machine learning where the train set is split in n parts (called folds). The machine learning system is then trained and evaluated n times, each time in all the training data minus one fold and then evaluated in the remaining fold. In that way it is possible to have evaluation results for an evaluation set of the same size as the train set without doing the "mortal sin" of evaluating in training. The technique is very useful and widely employed. However, that doesn't stop you from overfitting at the methodological level, meaning if you repeat multiple experiments over the same data you will get enough insights into it to "overfit" it. This methodological problem is quite common, so I decided to write it down. It is also not very easy to spot due to the Warm Fuzzy Feeling (TM) that comes with using cross validation. That is, many times we as practitioners feel that by using cross-validation we buy some magical insurance policy against overfitting.
Tags: Natural Language Processing, Computational Linguistics, Graduate School
A few years back my wife kindly hosted me at the McGill CS Graduate Student Seminar Series. It was a well attended, candid talk about how to succeed at incorporating natural language processing (NLP) techniques within graduate research projects outside NLP itself.
Given the nature of the talk, I did not find sharing the slides as I do for my other talks to be that useful. Instead I'm putting that content into this blog post.
Tags: PDF, MEncoder, YouTube
I have made the slides and audio recordings of the classes available on-line at the class' site (in Spanish, sorry) under a CC-BY-SA license. I have recently dovetailing the audio and PDF into videos which I'm uploading to a playlist in YouTube. In this blog post I want to describe the tools I used to record and create the final videos.
Tags: IRC, QA
If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.
A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)
The process involves (see the page on dev.tiki.org for plenty of details):
To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.
If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.