Home | Blog | Technical | About Pablo | Contact
Older Entries | 2021/08/23 | 2021/08/31 | 2022/02/12 | 2022/07/27 | 2024/02/24 | 2024/08/19
Older Entries | 2016/11/23 | 2016/12/12 | 2017/10/03 | 2019/03/23 | 2020/11/08 | ...
Older Entries | 2016/04/05 | 2016/04/21 | 2016/05/03 | 2016/06/30 | 2016/08/05 | ...
Older Entries | 2013/04/22@ | 2015/11/20 | 2015/11/25 | 2015/11/30 | 2015/12/06 | ...

Mining QA Pairs from IRC Logs Using Simple Heuristics and a Chat Disentagler

Mon, 22 Apr 2013 03:56:46 -0400

Tags: IRC, QA

permalink

If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.

A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)

The process involves (see the page on dev.tiki.org for plenty of details):

Downloading the logs
Normalizing the IRC logs across different IRC logging clients
Identifying users that are likely to have asked questions
Identify the threads where said users participated
Assembling the final corpus
Indexing

To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.

For the extraction step, I'm using a publically available implementation of Disentangling chat (2010) by Elsner and Charniak.

If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.

Comments

	Home
Copyright © 2010-2024 Pablo Duboue.