Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

Mining QA Pairs from IRC Logs Using Simple Heuristics and a Chat Disentagler

Mon, 22 Apr 2013 03:56:46 -0400

Tags: IRC, QA

permalink

If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.

A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)

The process involves (see the page on dev.tiki.org for plenty of details):

  • Downloading the logs
  • Normalizing the IRC logs across different IRC logging clients
  • Identifying users that are likely to have asked questions
  • Identify the threads where said users participated
  • Assembling the final corpus
  • Indexing

To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.

For the extraction step, I'm using a publically available implementation of Disentangling chat (2010) by Elsner and Charniak.

If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.

Using Apache UIMA Concept Mapper Annotator with Python via JPyPe

Wed, 28 Mar 2012 17:44:15 -0400

Tags: UIMA, Python

permalink

I have been using a lot of Python lately in work for a customer. Programming python has many positives, but when it comes to processing large amounts of text, I still choose Apache UIMA.

In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.

This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons.

Read the rest of this post at the Hack the Job Market blog.

Loop-AES blues on Debian testing

Fri, 22 Apr 2011 23:33:44 -0500

Tags: debian, sysadmin, encryption

permalink

Update: The situation seems to have improved, see this comment.

I use both in my desktop and laptop Debian testing (currently codenamed 'wheezy'). For backups, I have been using an AES256 encrypted ZFS external USB hard drive, with date-based snapshots and deduplication. I have two hard drives, one of which I keep off-site, doing weekly (or so) backups. I have all the same set-up in my laptop, and I have done some drill back-up recoveries from it (from the latest backup, not from the snapshots, as zfs-fuse doesn't support .snapshot folder, yet).

This solution has been working quite well until I went to Argentina for teaching. There, while I upgraded my Debian testing on the laptop, FUSE stop working properly in strange ways. When I got back to Canada and tried to backup my desktop, zfs-fuse was also non-working. The problem took me quite a bit of work to fix, I'm documenting here the problem and its solution (in a nutshell: loop-AES, in the package loop-aes-utils, diverts /bin/mount to a version that is incompatible with FUSE, solution: don't use loop-AES and move to dm-crypt).

Building Recursive-Descent Natural Language Generators

Fri, 22 Apr 2011 23:33:44 -0500

Tags: research, nlg, nlp

permalink

I am currently teaching a graduate-level course on Natural Language Generation and as such I have this blog a little bit forgotten. Now I found a topic to scratch both itches, a straightforward way to build NLG systems for small-size projects.

In the same vein that recursive descent parsers are popular for writing quick (quick to write, that is) parsers, simple NLG can be tackled by means of mutually recursive functions that closely mimic the generation grammar.

I have used this approach in the past for the ProGenIE generator (briefly sketched in the following PDF file in slides 2-5, I should release the code at some moment) and it is quite effective. Discussing with one of the students in the class, the approach can be extended beyond syntactic structures to simple semantic predicates.

The Free Software vs. Open Source debate and Hackerspaces

Tue, 28 Dec 2010 01:31:09 -0500

Tags: floss, hackerspaces

permalink

After some very interesting conversations at Foulab, I decided to put my thoughts on the issue of pragmatists vs idealists in a blog post.

The issue: what is the societal role of the hackerspace? As an association of citizens, the hackerspace is at an arms-length of political activity.

Now, I like to think the political role of hackerspaces is that of "self-determination through making", that is, empowering members of society to move beyond a passive consumer role (be it of products or even of political candidates) to one of active participation.

How does such role translates in day-to-day activities at a hackerspace? Well, it of course involves outreach, education, collaboration with other groups with similar goals and, of course, making awesome things and having lots of fun in the process.

It all sounds great, isn't it? Well, not quite. For many people, hearing the word "politics" is enough to make their teeth cringe. Moreover, hackerspaces attract people from a variety of positions within the political spectrum. (The whole concept of building things and using tools is a primeval instinct for the homo sapiens.) Therefore, activism can be regarded as a topic that will divide the hackerspace rather than strengthen it.

Interestingly, I see a very similar distinction in the free software world with respect to the Free Software movement and the Open Source proponents (see this article by the FSF for some canonical treatment on the subject or the recent talk Free as in what? A debate on open source vs. free software). Both parties agree on practical issues. Both want to have the best possible software, with complete source available. The Open Source people are content with stopping at that. The Free Software people want more. They want to society to change, to evolve into a freer society.

At the hackerspace level, we all seem to agree in the "self-determination through making" bit. We all want to hack cool stuff, share knowledge and become better makers. We want to own what we use beyond just having bought it. We want to solve our needs in a unique way that captures our distinctive self. But for many people, it all stops at that. I would like to call them the pure-makers. Driven by the desire and satisfaction of creation, they have little or no interest in the larger picture of how the hackerspace might fit into society and whether or not it has chances of changing society for the better.

Aside from the pure-makers, you can then have the maker-activists, people who want to share this newfound self-determination through making with a larger set of people. I like to think of myself as falling in that later group, although you can never be sure of course. Maybe being a maker-activist is more something to look forward to but that it can never be truly achieved.

To me, the parallels between pure-makers / maker-activists and Open Source / Free Sofware are very striking. It is the eternal strife between pragmatists and idealists.

When it comes to free software, the two camps are large enough to have separate quarters, so to speak. At this stage in the life of the hackerspace movement, it might be too early to try to do that. Moreover, as hackerspaces are all about making, the differences between pragmatists and idealists is not so profound, as we are talking about quite hands-on idealists. In my experience, many of the best makers I have met in my life are very absorbed on their making activities and have no time to entertain more abstract ideas (such as the impact of their activities in society, as a departure from the accepted role of silent consumer). I do not think that missing out on having these people in a maker-activists-only hackerspace will be a good idea. Quite the opposite, as they clearly enrich the place.

Another problem of having a maker-activists-only hackerspace is that it might get polluted by non-maker-activists. People who just talk but have no interest in making things. That will be highly detrimental to a hackerspace but it will be difficult to weed out in a more activism-oriented hackerspace.

To summarize my views on the topic: in hybrid hackerspaces such as Foulab, the diversity brought by having pure-makers and maker-activists should be welcomed and embraced. Why? Because it helps strenghen the making aspect of the organization. However, it should be clear to both sides the truce-like aspect of the situation. The maker-activists should not try to hijack the name of the organization for their activism but they are welcomed to use the place as they see fit. The pure-makers have to show a level of respect to the activist agenda of their fellow hackerspace colleagues but they by no means have to spouse such ideals. In a nutshell, respect, communication and non-interference. Being an optimist person, I hope this can be achieved at least until the movement grows to the point of having separate spaces, even if I'm currently unsure whether that would actually be a good idea.

Montreal DebConf12 bid

Thu, 04 Nov 2010 11:22:56 -0400

Tags: debian, floss, montreal

permalink

Last DebConf, Antoine Beaupré (IRC: anarcat) and others put together a bid for DebConf12 in Montreal, an effort I am (of course!) supporting.

On the last Tuesday of October, Antoine and I had a meeting at Foulab. The invitation to the meeting went out to a larger invite list of people in the province and nearby cities, but for this first meeting only he and could made it. Antoine took some notes on a gobby server, I am posting here my notes.

Possible Venue Locations

The place where DebConf happens is a key element for winning the bid. Things to take into account are costs (can it be fully sponsored?), sleeping quarters proximity to talk rooms / hacklabs and in-place food halls (which should accommodate all the attendees at once). These are more wishes rather than requirements, but they can help choose among options.

We discussed:

In general, we want to make sure we stress in the bid that, for comparable cities, food and lodging in Montreal is cheap.

Network

A good network speed is key for even thinking about organizing a DebConf. Imagine multiple talks simultaneous video streaming, dozens of developers git cloning remote repositories, etc. Sadly, my short experience so far is that Internet access in Québec is really sub par (something everybody seems to be aware of). In short, this is is a key issue to focus on while bidding.

So far, we discussed aligning ourselves with non-profits that are pushing for better and more widespread Internet access.

Organizations

We discussed some umbrella organizations to help us with logistics and volunteers. This list is more a list of organizations to contact at this time.

  • Foulab (through me)

  • Koumbit (through Antoine)

  • Communautique (might help us reach out to potential volunteers)

  • Facil (might want to help us out with fundraising)

Finally, these organizations might want to help spreading the word

Government grants

After having been working on fundraising for DebConf10, government fundraising is a little bit of an unresolved issue for me. Both DebConf9 and DebConf11 rely heavily on government funding, up to 50% (or more!) of the total costs. As I have said elsewhere, by the time I started working on DebConf10 fundraising, it was too late to go for government sources.

Antoine has some experience dealing with the Québec government for funding, but as it usually goes, it wasn't easy and it took time for the funds to appear.

For searching for grants, we might want to reach out to a local start-up specializing on that.

When requesting Québec funding, I believe we can emphasize the support of the French language and Franco-phone community within Debian (we can ask Bubulle for some pointers about it).

Overall, after talking with Antoine, I feel government funding is possible, but won't be on time for the bid. Moreover, there are intrinsic issues of the insularity of a DebConf with respect to the goals of a local government. Also, I wonder if by accepting Québec funds we will be forced to run a bilingual conference. We should definitely check that.

Important issues at bid-decision time

  • Internet (discussed above)

  • Venue (discussed above)

  • Accessibility: DebConf prides itself of taking care of its attendees including of course the ones with non-standard needs. This issue is a little bit venue-related but also has to do with the city in general (which Montreal seems OK, saving the underground city) and the activities organized during DebConf.

  • Getting to the venue (fly to the city / from the airport): Montreal is decently connected to the world, particularly through Toronto and New York so getting to Montreal is no so much of an issue (beyond costs, of course). But getting from the airport into the city will require plenty of hand-holding as it is quite a bit of an ordeal at the moment.

  • Local language: we should stress that Montreal is perfectly accommodating of English-speaking tourists. I had heard that before but after living here for a few months, I can attest is true.

  • Visas: it is not going to be easy, so better look for smart ways to deal with it. In DebConf10, we were blessed with a lawyer that gratuitously volunteer his time to help our attendees with letters, support and insights (Franklin Bynum, thanks!). Finding somebody local with such skills will strengthen our bid considerably.

Overall, the meeting was very good. I have been talking with people I met here and there and have found a few Debian supporters that expressed interest in hearing more about this bid. Finding Debian supporters in Montreal, by the way, it is not easy, as Montreal has a very strong Ubuntu (and Canonical) presence. Antoine suggested using that fact to spice up the bid as a DebConf of "building bridges between Debian and Ubuntu". I have no clear opinion but definitely an idea worth reflecting upon.

The next stage will be to organize a larger meeting with these volunteers plus others. Whether we will win this bid for 2012 is, as always, very uncertain (as it depends on the quality of the competing bids) but the goal is to assemble a team interested in making DebConf-Montreal a reality and then keep improving the bid until it is good enough.

Started tinkering with speech technologies

Wed, 03 Nov 2010 06:51:07 -0400

Tags: scientific, hackerspace, floss

permalink

Last week has been pretty intense putting a somewhat academic knowledge of speech technologies into action. My previous real world experience was maintaining and integrating a concept-to-speech system for a complex, medical report system back in graduate school. That of course entailed to listening to hours and hours of computer-generated speech. Joy of joys *grin*. The good part is, once you get used to it, learning emacspeak was not so difficult (but that is a subject for another post). As such, even though I have been quite literally surrounded by speech scientists in the last ten years (I had two in my thesis committee), I had never taken the time to get properly acquainted with the two main tasks of speech technologies: training an ASR and building a speech synthesis voice.

Interestingly, there has been an art installation project I have been toying with the idea for a while. Its provisional name so far is "The Talk of The Town" (TToTT for short) and it involves GNU Radio, some very light speech recognition and words concatenation. The idea is to switch on a radio and listen to small segments from different radio stations as if you were browsing the airwaves. The key part is that the small audio segments from each radio station are actually complete, discernible words and the different words coming from the different stations made up a syntactically correct message. The semantics of the message are to be meaningless in general but some emergent message my appear due to the similarity of topics on airwaves (thus the name, "The Talk of The Town"). This project is in early stages and from my experience in the past week, it won't be finished anytime soon. I'm keeping the relevant material at http://www.duboue.net/ttott/ (there is nothing there beyond the above description at the moment).

Even though the idea of building TToTT has been making circles in my thoughts for a while, what triggers start working on it has been Foulab and the upcoming mini Maker Faire Ottawa and my desire to have something to share at the show.

So I decided to do some proof of concept using CMU Sphinx and a bunch of Ottawa podcasts (to make it local). Getting CMU Sphinx to work was an insane amount of work. I finally went for CMU Sphinx 4, which is written in a language I am very familiar with (Java) and has the best engineering so far. Still the models available through the Sphinx project are really lacking (5,000words) or incomplete (as far as I could tell). Having dealt with the realities of research corpora distributions myself, I think it is really great the people behind Sphinx put all this effort (and I am hoping I can contribute something to the project in some moment) but without the source speech, it is very difficult to troubleshoot the ASR. And I don't have $25k to have a membership to the Linguistic Data Consortium, where their training data originates. Initially, I went around in circles using the default WSJ model thinking I had something wrong on my setup until I found this page on their site:

A simple Sphinx-4 application that transcribes a continuous audio file that has multiple utterances and creates a lattice using the results. The default file, called "10001-90210-01803.wav", contains three utterances, separated by silences. Please note that the recognition accuracy will be very poor due to the choice of wav file.

(...)

I heard: what's there owns their o. zero one

(...)

I heard: dynamics to one oh

(...)

I heard: zero one eight searle three

So you feed numbers in and get this output back (a WER of 50% or more --word error rate). Clearly the model is really lacking but without this official confirmation it is very easy to assume otherwise.

But hope is not lost. ASR technology is very important, particularly for people with disabilities and the good folks at VoxForge have assembled a really nice 40,000 ASR model with LDA features (this is among the things I learned this week, it has been a long week!). Using it made things a little better but I was still getting a WER (word error rate) in excess of 70% with the Ottawa podcasts. The word segmentation was not correct, even.

Finally, I turned into LibriVox, where volunteers read Project Gutenberg e-books and put the resulting recording into the public domain. As I was saying Sphinx4 is really very nicely engineered so it can easily do alignment instead of recognition. My initial test (Chapter I of The Golden Snare by Oliver Curwood) has turned astonishingly good results, although it required some manual trimming of the input text and waves. I am now planning into streamlining this manual process in http://www.duboue.net/ttott/librivox/ because this aligned corpus can be fantastic for a variety of uses. For instance, it can provide high quality speech voices and improved ASR for the Free and Open Source world (which is in dire need of nice, user friendly, well trained tools). Which brings me to the last item this week: speech synthesis.

Happy with the alignments I am getting from CMU Sphinx4 and LibriVox, I started looking into Festival for a better blending of the word samples. The FestVox project deals with adding new voices to Festival and it includes a 200+ pages WIP book detailing the process. So I have been digesting it chapter after chapter and deciding where to go from here.

In the mid-time, for this weekend demo, at Foulab we decided to go for a demo involving coherent speech and we settle down for the voicing the Canadian Copyright Act as it fits the tone of a Maker Faire in Ottawa. My personal target is to voice it with five different voices from LibriVox, so time is tight. Last but not least, if you have a nice voice, time to give back to humanity and reasonable recording accommodations, consider to contribute to VoxForge and LibriVox.

The need for shared intelligence among local chairs in academic conferences

Fri, 22 Oct 2010 02:45:26 -0400

Tags: academic, conferences, debian

permalink

Just got back from visiting a few friends in NYC. We were discussing about how tricky is to deal with local arrangements, from my passing experience helping with DebConf10. One of my friends had recently been through a similar ordeal but in a much more engaged level (smaller team and she was the chair of the local arrangements team). Interestingly, she has plenty of trouble with their caterers at the beginning and after complaining profusely, the problem got resolved. I was not particularly thrilled about some of our eating arrangements for DebConf10 (no innuendo here, the cafeteria food was atrocious) but given our uncertain status with the host university it was not in our best interest to complain about it. But some sort of picture starts to emerge... it seems from a business perspective, caterers realize that less is more, particularly when talking about food and profits. I mean, it is an outstanding business model, you charge people and provide no food. What can go wrong? Well, of course you will get no repeat customers, but (and here is the key part) conferences change locations and teams and they know it.

That is not to say all caterers are like that, but the ones who play by the rules and provide what was agreed upon before the conference in a sense are penalized. They spend more money than their competitors that will "forget" to bring food for a coffee break (another horror story, from a different person and conference) or bring 10 muffins for 25+ participants or "all you can eat buffet" for dinner with only pizza and ice-cream (I lived through that, although I do not expect it helped my life expectancy). The extra money-for-nothing can then be used by their competitors to buy more machines, improve advertisement or just go to Maui. The point is that conference organizers (particularly of the academic-style, I would put DebConf somewhat in that category, at least for the sake of this discussion), have no way to protect themselves without some information sharing.

Why this is more of a problem for the academic-style conferences? Because within larger companies, things like meetings, retreats, conventions, etc. are organized by people who organize these things for a living. These people share across the organization lists of preferred vendors for the different services they contract. And the vendors know that. And if a vendor tries to feed the void of their increased profits to their attendees, they will get blacklisted and will miss a lot of business. And the vendors know that.

To improve on the current state of affairs and to make the life of local arrangements people nicer (and gain more weight eating delicious foodstuffs at conferences), we can try to have yet-another-reviews site with dynamics reasonable for the academic people.

General purpose review sites are a tricky business these days, so a resource like this is better kept semi-confidential, in a members-only manner. Where members, of course, are current and past local arrangements team members. The site can be very straightforward, just using mediawiki and restricting the content to registered users. The content can be bootstrapped by kindly contacting past local arrangement chairs through friends and colleagues in a few seed academic disciplines. Once the site has a certain momentum, new members will join in seeking information and they will (hopefully) share back their feedback.

If no such site exists, I am willing to put it together if there is interest by other people, especially in rounding up the first batch of contributors. Contact me if you want to jump in.

Finally, a blog

Sun, 17 Oct 2010 23:45:11 -0400

Tags: personal, technical, meta

permalink

My site is still work in progress and I am very likely not going to be using a blogging engine as such, but something built on top of DocBook Website. Why DocBook? First, it allows for writing the pages into a little bit more semantic markup à la LaTeX, without having to write them in, for example, LaTeX (grin). But also has automatic detection for correctness of internal links, generation in different formats (text-only, PDF), plus the possibilities of extending the Website generation through XSL. The biggest selling point for me, nevertheless, is that it makes possible reasonable multilingual sites, using something like DocBook.sml. Given the fact that my mother tongue is Spanish and that now I live in a French-speaking country, this site might need to grow into two more languages over time. We can only hope.

But DocBook Website in its current incarnation is definitely lacking for blogs. I will need to fiddle with XSL a little bit to get it into something more akin to a blogging engine. (Just tried to search for docbook blog engine, just found a bunch of blog postings about DocBook. Sometimes I really miss Semantic Search. But then again, how to get all those nifty annotations? Clearly a topic for future blog posting.