Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

The Value of Outliers

Tue, 03 Oct 2017 02:49:30 -0700

Tags: academic, philosophical


Moving coast-to-coast has taken most of my energy since the last post but we're finally established in Vancouver so I can come back to blogging more regularly. This post is about a piece of advice from the classic book How To Lie With Statistics extended to groups and society in general. It continues ideas from two other blog posts (What Random Forests Tell Us About Democracy and Have Humans Evolved to be Inaccurate Decision Makers?) regarding statistics, decision making and politics.

Let's sample 10 (pseudo)random numbers from a normal distribution centered around 100:

       >>> map(lambda x:int(random.normalvariate(100,10)), range(10))
       [85, 99, 97, 78, 87, 93, 91, 112, 90, 91]

Now, if you look at these numbers, you'll be tempted to conclude that 112 is just... wrong. A measurement error. That it does not belong there. However, the average for the 10 numbers is 92.3, still far from the real mean (100) but within the sigma we used to generate the sample (10). If we were to drop 112, the average for the remaining numbers will go down to 90.1, making it a worse estimator than before.

I believe the same happens in the realm of ideas. If each person has a piece of the truth, shutting down their contributions, irrespective of how far from the truth their might sound, will lead you farther away from the truth. A similar concept in the business world is Groupthink. This of course does not mean that outliers need to dominate, just not eliminated completely.

And if you haven't read the 1954 book by Darrell Huff, it is very short and makes for a great read. Its starting premise "democracy needs voters informed on basic statistical matters" is as up-to-date in our data-driven world as ever.

More Like This Queries on SQlite3 FTS

Mon, 12 Dec 2016 03:05:48 -0500

Tags: open source


SQLite is a great embedded SQL engine, and part of the Android platform. It has an extension FTS ("Full Text Search") that enables Boolean search queries (that is, mostly unranked). For small collections of documents (like a blog), Boolean searches could be a viable temporary solution until a full solution (like elasticsearch) can be deployed.

A common type of query supported by elasticsearch are More Like This (MLT) queries, that allow you to find similar documents to given ones. This type of queries are also very useful for blogs, for example. If you're using SQLite FTS, you can construct a query that will approximate MLT by issueing an OR for all terms in a document (in FTS, the terms are lowercase and uppercsase 'OR' or 'AND' are considered logical Boolean operators). The only issue is to obtain the terms, as assigned by FTS. To do so, it is necessary to access the FTS tokenizer by creating virtual table, for example:

CREATE VIRTUAL TABLE tok1 USING fts3tokenize('simple');

Then, given a document, the terms for it can be extracted by doing (in PHP):

$tokens = $db->query("SELECT token FROM tok1 WHERE input='" . SQLite3::escapeString($all_text) ."';");

The query itself can be assembled by taking an OR of the set of terms:

$query = "";
$query_array = array();
while($row = $tokens->fetchArray()){
    $token = $row['token'];
    if(! isset($query_array[$token])){
        $query_array[$token] = 1;
        if(strlen($query) > 0) {
            $query = $query . " OR ";
        $query = $query . $token;

This is the query that can then be used against the FTS to provide a makeshift MLT functionality.

Some ideas for the A.I. XPRIZE

Wed, 23 Nov 2016 00:31:29 -0500

Tags: research


As I'm married to a current IBM emloyee, I'm disqualified from participating in the AI XPRIZE sponsored by IBM. So I'm putting my ideas in this blog post, might they help inspire other people.

The XPRIZE follows the path of other multi-year challenges that have resulted in great accomplishments such as commercial rockets. In the case of the AI challenge, it diverges from previous challenges by being completely open ended: any major AI break-through can win the 3M USD prize.

What I'd like to see is a team tackling improvement in scientific communications by leveraging recent advances in machine reading and taking them to the next level. I would like to see some work on scientific metadata (possibly in the directions of the Semantic Web) that captures the main dicoveries in a scientific paper. This metadata should be feasible to be produced by a human, the machine reading aspect is there just to bring enough value to the metadata during transition to entice humans to self-annotate.

The case for this improvement lies in the amount of researchers in many key fields having physically no time to keep up-to-date with published results. A high-level summary or the possibility to query "has anybody applied method X to problem Y" would be invaluable. Moreover, this type of setting allows for a very constrained inference, simplifying scientific discovery for sometimes obvious, sometimes overseen new findings.

I'm not stranger to this approach. My most cited paper came to be as a contributor to a multidisciplinary project on automatic extraction and inference in the genomics domain (some form of automated inference was later realized many years after I left the project).

This is further simplified by reporting standards in many scientific disciplines. Take for example, the one from the American Psychological Association (thanks to Emily Sheepy to pointing me to that report). Such standards specify the type of contributions and the information to be expected on them, even up to the headers of each section.

I believe all these pieces together have a chance of, if now winning, at least doing well in the competition. And irrespective of the competition, this technology deserves to exist and help accelerate human discovery.

Regarding business aspects, it would be nice if the metadata format is an open format and the commercialization centers on extracting metadata from existing publications and authoring tools. Extracting metadata and doing inferences for profit is somewhat contrary to the goals of accelerating research, but that's speaking as a scientist, not a business person.

Let me summarize the concept with an example:

  1. Given an existing paper, for example Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network (Toutanova et al., 2003) produce metadata of this type:


    Part-of-Speech tagging (as a link to an ontology)


    Conditional Markov Model (or CMM with features x,y,z; all linked to an ontology)


    97.24% accuracy over Penn Treebank WSJ (the metric and the corpus are also links)

  2. These entries can be further populated by the scientists upon publication, maybe with the help of an authoring tool.

  3. From this metadata, a system can answer "what is the best performance for POS and what technique does it use?" but also "POS and role labeling are similar problems (fictional fact): both use similar techniques and both rank their performance similarly; however, the best performance in POS is using skip decision lists (also a fictional fact) but that technique has never been attempted on role labeling"

Wish the participants the best of luck. Look forward seeing great technology being developed as a result of the challenge!

FADT analysis of a MMPORPG: Anarchy On-line

Mon, 24 Oct 2016 05:26:30 -0400

Tags: academic, games


More than a decade ago, back in graduate school I took a class on Video Games technolog & design, taught by David Sturman and Bernard Yee. One of the class assignments was to do an analysis of Formal Abstract Design Tools (FADT) analysis of any MMORPG of our choosing. I went for Anarchy Online, that had been release recently (which in turn is still running, making it one of the oldest surviving games in the genre). I recently found the sources for the write-up and decided to take advantange of the fact the report was written in LaTeX and I'm blogging using DocBook so I can transform one to the other easily. I have of course shorten it down and cleaned it for style. I recon I am a better writter now than I was back in graduate school (which among many things is the whole point of attending graduate school).

All in all, writing essays for school work is a little bit sad because if the essays are of good quality, only the instructor or the TA gets to read it. If I find any other interesting essays I will post them here and I encourage people to do the same on their own blogs.

From the write-up, the most interesting part is how I seem to longe for something in the style of Second Life, which was being launched at the time I was doing the write-up. And I really liked using the FADT, it might tempt other people to learn more on the topic. (Note: I would expect the original links are all dead links by now, but the Internet Archive can be of help.)

Have Humans Evolved to be Inaccurate Decision Makers?

Wed, 19 Oct 2016 06:56:25 -0400

Tags: academic, philosophical


Many years back, when I had just moved to Montreal (a great city to meet some of the most fascinating minds in the world, a topic for another blog post), I met Dr. Costopoulos from the Department of Anthropology at McGill University. He mentioned some work he and his students have been doing simulating animal decision making using ice core data.

The premise is simple, particularly coming from a machine learning, optimization and genetic algorithms background: in an environment punctuated by slow, progressive changes followed by cataclistic changes in the opposite direction, individuals that track the enviromment better will overfit (and die). More inaccurate individuals will be the ones surviving long term.

I finally tracked the paper published behind this research: Xue JZ, Costopoulos A, Guichard F (2011) Choosing Fitness-Enhancing Innovations Can Be Detrimental under Fluctuating Environments. PLoS ONE 6(11): e26770. While it explicitly states their assumptions are too simplistic for human populations, the theoretical ground is sound: the best decision makers among human populations should have been wiped out during the rapid deglaciation periods (think woolly mammoth), some of them as recent as 20,000 years ago. This of course has many interesting implications, some I will discuss next.

Some Reasons to Wait to Create your Own Startup upon Graduation

Tue, 09 Aug 2016 03:45:23 -0400

Tags: business, academic, philosophical


I got asked my thoughts about joining a startup as a technical co-founder straight from an undergrad in Computer Science. Even though I appreciate the lack of contributions behind saying "do not do it", in this case I feel it is more of "do not do it... yet". The key reason is that the interpersonal aspects of technical work has not truly been exercised in college, plus missing out on the simplified straight-out-of-college hiring process. This is a very opinionated post. If successful, any random reader should disagree with plenty of its content.

First, software development is an inherently communal task, a fact that is usually missed with the technical focus of academic instruction. Even though many courses can offer team assignments and projects, that is a far cry from programming in a team, particularly with seasoned developers. There is plenty to be learned from these people you will be missing out jumping to create your own company right away. Lessons that you can then apply when working with a team of your own employees.

In the same vein, a co-founder role will involve management duties in some moment. Trying to manage people without having been exposed to any management whatsoever seems quite difficult. You might be able to do it, but management is something where following some example of a previous manager in your life can be very positive.

Regarding technical skills themselves, starting a project from scratch, chosing the full stack yourself and having absolute control of the technical decisions would be the most appealing reason to be the tech cofounder in a startup. No question there. But at the technical level you will become a jack-of-all-trades type of person that has not necessarily very deep knowledge in none of the technologies involved. Now, most startups fail, what is your plan B? Would the experience you gain doing this will help you advance your professional career? From what I have seen, it pays off specializing deeply in technology (when that technology is of interest to the market, that is). But hey, I went on to do a PhD so that's what I know.

Now, doing somehting or not doing it revolves around the opportunity cost. Time is unidimensional. If you spend your time with the startup, you would not spend your time doing other things. So what is it you would miss going for the startup? The excellent hiring process straight-out-of-college. When you are being hired in a company, there is a process (that many people are trying to improve, I contributed to a now defunct startup in that space a few years ago) but still boils down to keyword matching. They are looking for a person with knowledge of technology "Jabberwocky" and if your previous work experience does not include "Jabberwocky", you are out of luck. It is simpler to hire somebody that knows that technology that let you train in the job. But the process of coming straight out of college is different and it is based on target schools and GPAs. Therefore, after one or two years in your startup, you will need to jobhunt based on the technology stack you used in the startup. If you tried to inflate it (including "Jabberwocky" when you didn't really need it), you were doing your startup a disservice and that could be partially a reason for its failure. But if you use straightforward, less fancy technology, you might have a hard time marketing your skills.

Now, the main reason not to go is... because you are asking. There are things in life where hesitation is a big negative sign (going to graduate school to pursue a PhD and getting married come to mind). From what I have seen, entrepreneurship is a personality trait. A true entreprenuer would have started a couple of ventures through their undergraduate years, because if any of them truly pan out, there was no reason to have a degree. If you are not an entrepreneur at heart, but the technical self determination that comes with being a technical co-founder attracts you, gain more technical skills to maximize the chance of success and earn some money to wait the right business co-founder. With a stable job you can judge the feasibility of startup ideas "on a full stomatch" (whether that is good or bad, it is debatable, but for technical co-founders, I do believe it is a good mindset). If you read all this and wholeheartedly disagree with it, good luck in your new venture! Courage goes a long way.

Sharing a Folder Between Debian and Android using Sparkleshare and SGit

Fri, 05 Aug 2016 03:29:53 -0400

Tags: floss, debian


I have been looking for a self-hosted alternative to commercial "cloud" products for a while. Initially started using rsync but it had the problem that you need to remember the directionality of the updates: when a file is deleted in a copy, there is not enough information left to know whether the file is a new file in one of the copies or if the deletion was a purposeful act on behalf of the user. Therefore it is necessary to indicate the directionality of the deletions which is error prone. And some updates might be at both sides of the copies.

Looking for alternatives I peered into OwnCloud, even though a PHP implementation was not really my cup of tea. I found the project has rejected being packed by Debian (a major red flag for me, as I trust the Debian security team to keep old versions securely patched on my personal servers) and that the project has been forked. So no OwnCloud for me.

Given my wishlist of a tight Debian GNU/Linux and Android integration, I looked into a few other alternatives (Syncthing and dvcs-autosync come to mind) but decided to settle on sparkleshare as I thought it had a version on the F-Droid FOSS Android Apps Repository. I have heard from some friends using it that it has its glitches but that overall it was just git underneath so you can use it/repair it as you see fit outside of sparkleshare. This is the case so far and it ended up being a key feature for its interoperability with Android.

I thus setup a remote account for the share, with git installed in the remote server and nothing else (there is no need to install sparkleshare on the server, even though the instructions for using the sparkleshare App seems to indicate so). Now, the existing App only allows you to download files on demand and you cannot modify files. It is thus non-operational as a shared folder. As sparkleshare simply maintains a git repo automatically, I am then accessing through SGit, an incredibly resourceful git client for Android.

My workflow is then as follows: in my desktop and in my laptop, I use sparkleshare (one in Debian testing, the other in Debian stable, both interoperate just fine so far). These copies receive plenty of changes and are kept up-to-date just fine. Then in my phone and tablet I use SGit to pull the updates. In the few situations when I need to modify a file (usually a text file with notes on a paper I'm reading or writing), I use a text editor and then I have to go through the slightly clumsy steps of: adding the file to stage, commiting the changes (making sure the checkbox "Auto stage modified files" which is checked by default is unchecked, that takes forever on a mobile device) and then pushing the changes (which requires clicking on "origin", the interface is slightly confusing there). This workflow makes sense for a regular user of git, which is my case.

Now for the bad news: the maximum file size that can be unpacked on the Android version I have seems to be capped at 64Mb (or slightly less). This is really a let down, but seems to be a limitation that exceeds SGit internals and it might be related to the zlib version shipped with a particular Android version. That is not a show stopper but requires a little too much vigilance for my taste (if I drop a file on the sparkleshare folder, it will immediately polute the git repo with a large blob if the file is over 64Mb and the repo will need to be recovered or restarted afresh). On the good news front, if that happens and the repo gets borked, setting a new one is very simple. And as the repo gets bigger and bigger (it stores all files and their history), resetting it by changing to a fresh one from time to time is necessary anyway.

I have been using this solution for a few months and I am quite happy, but would consider switching for something handling bigger files, limited history and the same binary on the Debian and Android sides. I would expect that will take some time to come around, though.

Programming With Computers, Partnering With Machines To Create Programs

Thu, 30 Jun 2016 12:21:25 -0400

Tags: academic, philosophical


I have been invited to write a book chapter on lexical choice for translators (contact me if you want to see a preprint). To get acquainted on this audience different from my usual computer science I read a few papers on professional translators use of technology. Two of them are quite interesting and I recommend them not only because they make for a good read and they have implications outside translation: Translation Skill-sets in a Machine-translation Age by Anthony Pym (2013) and Is Machine Translation Post-editing Worth the Effort?: A Survey of Research into Post-editing and Effort by Maarit Koponen (2016). This search finished by reading a short ebook by researchers at the MIT Center for Digital Business titled Race Against the Machine: How the Digital Revolution Is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. In that book plus the papers there's this call for humans, if we want to remain employed, to hybridize our work and to seek out ways to work with the computer as some sort of partnership. That process is clear in human translation: checking from previously translated similar sentences or the output of machine translation (instead of creating new translations from scratch).

The question is then what about our trade? What it means to be working on a partnership with the computer rather than for the computer? As other people, I have argued that machine learning (more specifically supervised learning) is akin to traditional programing (in the old soft computing style). It follows many of the pros and cons of the redefined labor of the human translators.

But that's not all. Other areas of programmer / computer partnership that are less deployed (but nonetheless quite explored in the scientific space) are declarative programming techniques for both program verification and program synthesis using automatic theorem provers. The idea here is that instead of writing test cases you write test cases generators and the property checkers for the output of your programs over those generated test cases. I have experience with the Haskell library QuickCheck2 and it's quite pleasant to use (Thanks to teaching me how to use that library, gracias che!). There are now similar libraries for other programming languages. How can this be described as a programmer / computer partnership, you might ask? At the end is just another test framework. The difference is in the type of task the human is doing (enunciating properties) and the computer (doing the grunt of checking the said properties). Traditional unit testing has much more grunt work on the side of the programmer.

That focus on overall properties rather than the code behind it bring us to the hope of automatic programming using theorem provers. There has been some massive improvement in theorem prover capabilites using general SAT solvers in recent years. Maybe it's time this new technology start finding its way into the desktops of professional developers.

Now these skills are different from regular developers. The same can be said from machine learning. Many great practitioners in machine learning ("data scientists") are average / poor developers but come with backgrounds in engineering or science that makes them thrive in an extended programming task considering supervised machine learning as programming. It reminds me of the fact (as brought by Race Against The Machine) that the best chess players in present times are neither humans nor computer but a thriving partnership of not necessarily the best humans nor the best computers.

Borrowing a page from the experience of human translators, there'll be a time when painstaking 100% human created programs will be deemed too expensive for most but few mission critical situations. And the rest will be created by a redefined computer professional. At this stage this is a mental exercise but given the example from human translators, definitely an exercise worth engaging.

Using Recommendation Systems To Bring A Human Touch To Billions of Humans

Tue, 03 May 2016 03:55:31 -0400

Tags: academic, philosophical


Since I was born, the planet population increased by 50%. (I even heard half the humans ever existed are alive, that's false, more like 6%.) This is all anecdotal but my recollections from childhood speak of a place with just fewer people, where shopkeepers know you by name and expected you to buy certain items regularly. They would know what you like and bring products catering to their audience. Such experience for the most part is lost (it might remain in small towns and such).

Interestingly, human population was rather small for tens of thousands of years. Our human expectations about relating to each other are in line with small communities. This topic is outside my realm, but I heard about it before, Google points to an Urban Ecology paper from 1978 that says

the persistent human propensity to identify with small groups is a consequence of our evolution as a mammal

A machine learning / information retrieval technique that has grown immensely in popularity in the last two decades are recommendation systems. When teaching them before (see my class on the topic, in Spanish), I realized they bring that much needed small village feeling to on-line transactions. When you enter Amazon or Netflix, they "know" you and recommend things based on what they know about you. It is paradoxical that we now need computers to bring a much needed human touch to our interactions.

Moreover, some of the techniques used in recommendation systems (such as user-based recommender), build such small villages as part of their algorithms. In that sense, when Netflix recommends you to watch Anastasia for the fourth time, it actually has enough information to recommend you other people who, you never know, might actually want to form a small village with you.

What Random Forests Tell Us About Democracy

Thu, 21 Apr 2016 05:15:16 -0400

Tags: academic, political


A popular method for learning from large data sets is Random Forests (see my class on the topic, in Spanish). I would like to drive a paralellism between the way they work and our political decision structures and the so called Wisdom of the crowd.

Random Forests are what is called an ensemble method as they perform better than individual methods by combining their results. The individual method used in Random Forests are Decision Trees, trained from a subset of all the available data (and because of this property of operating on subsets of the data, they are a good method for applying on large datasets).

More interestingly, Random Forests (as discussed in the Machine Learning article by Leo Breiman in 2001), can not only train each of their trees on a subset of the data but also use a subset of the available information (features) when training each decision node in the tree. That makes each of the trees that are part of the ensemble truly random! When creating each individual tree we only see a subset of the data and only a subset of its characteristics. To decide the outcome of the decision, each of these random trees is given a vote. The most voted decision wins.

Now, the "magical" part is that they perform better than a decision tree trained on all available data. Even if the tree were made "smart" by prunning poorly constructed branches (the trees that make the ensemble are unpruned). And they are so high performant that a recent comparative study of 179 different classifiers found them to be consistently top performing across a large set of problems.

Now, if you think for a second, this is the way direct democracy works: each voter has access to a subset of the information and only sees that subset from a particular perspective (their own unique perspective). By using a majority vote, we are actually implementing a Random Forest. And from the theory (Breiman paper is quite delightful) we can see that we don't need more informed voters, just more of them. Food for thought.

The Spouse of the PhD Candidate

Tue, 05 Apr 2016 03:53:31 -0400

Tags: personal


Many moons ago, I did my PhD. That was years of hard-ship, unknowns, anxiety. But also self-discovery, with plenty of fun and exciting times.

Then I met a wonderful woman, we got married and she decided to go back to school and pursue a PhD on her own. I have to admit the process of accompanying a loved one through graduate school is far, far worse than going through it yourself. Even knowing full well the challenges ahead, it is much worse to have to see her suffer through them without being able to do anything about it.

The feeling of helplessness while watching a person you care for deeply being in distress is nothing like I have experienced before. We were lucky the decision of going back to graduate school was well thought out and discussed at length even before she applied. But even then there were years the whole process took a terrible toll on our marriage.

It is thus I want to extend my congratulations to Dr. Ying for successfully defending her PhD thesis entitled "Code Fragment Summarization"; to all other PhD candidate spouses out there, I hear your pain. There is light at the end of the tunnel!

Older Posts