Pablo A. Duboue
Kathleen R. McKeown
Vasileios Hatzivassiloglou
Columbia University,
Dept. of Computer Science
New York, NY, 10025, USA
{pablo,kathy,vh}@cs.columbia.edu
http://www.cs.columbia.edu/nlp/
Our goal is to provide intelligence and law enforcement personnel with means to quickly and concisely communicate information about military and political personnel from foreign countries, and also terrorists and criminals. Working on different scenarios, different users of the system will require different presentations of available data about a given individual. For example, one analyst might want to see an overview of all data for a particular person, while another analyst may be looking for ties between a well-known terrorist and a particular country. We intend to fulfill these requirements via on the fly generation of such person descriptions.
This paper is organized as follows: we shortly describe the motivation and relevance of our system. In Section 3, we describe PROGENIE's three major components. Some final remarks conclude this paper.
Person descriptions has been addressed in the past by IR, summarization and NLG techniques. IR-based systems [4] will look for existing biographies in a large textual database such as the Internet. Summarization techniques [5] produce a new biography by integrating pieces of text from various textual sources. Natural language generation systems for biography generation [6] create text from structured information sources. Ours is a novel approach, that builds on the NLG tradition. We will combine a generator with an agent-based infrastructure expecting to ultimately mix textual (like existing biographies and news articles) as well as non-textual (like airline passengers lists and bank records) sources. PROGENIE will offer significant advantages, as pure knowledge sources will be able to be mixed directly with text sources and numeric databases. It diverges from the NLG tradition, as we will use examples from the domain to automatically construct content plans. Such plans will guide the generation of biographies on unseen people. Moreover, the output of the system will be able to be personalized; and by the fact that the system learns from examples, it will be able to be dynamically personalized.
Three components make up for our system: a knowledge component, a learning component (our research focus) and a generation component.
Learning Component. The key to greater flexibility in biography generation relies in a particular piece of the generation pipeline, the Content Planner. A content planner is responsible for the distribution of the information among the different paragraphs, bulleted lists, and other textual elements. Information-rich inputs require a thorough filtering, resulting in a small amount of the available data being conveyed in the output. The selection and structuring of the text, performed thus by the content planner, is responsible for our sought flexibilities.
Our research objectives focus on the automatic acquisition of schemas, data structures that guide the content planning process [3], by means of machine learning techniques. We employed an aligned corpora of input data and output text to induce schemas using stochastic search [1]. Such schemas are then used to generate biographies on new people, different from the one used to learn them. The final system can then be easily customized for new needs or scenarios by the final users, expanding the current work to possibilities unforeseen by us.
Knowledge Component. While the data employed to generate the biographies can be supplied by internal databases and networks such as Intelink or IAFIS [7], we plan to provide input to the generator by using information extraction agents on the Internet. Publicly available data can be of great use to mine information for well-known personalities and a test bed for the final system running on private intranets.
To represent the input to the generator, we chose a variation of RDF. This selection strives for generality, reuse and portability.
Generation Component. Seven modules in a pipeline will compose PROGENIE. These modules will include an Inference Module, a Content Planner, a Text Planner, a Referring Expression Generator, an Aggregation Module, a Lexical Chooser and a Surface Realizer. In this setting, the Content Planner module executes the learned schemas. We use a variation of the Lexical Chooser (that selects words for concepts) from the MAGIC generator and the FUF/SURGE unification based package for the Surface Realizer. Finally, the other modules will behave as follows: the Inference Module performs some limited world knowledge inferencing; the Text Planner splits a rhetorical tree into paragraphs; the Referring Expression Generator handles mostly pronominalization, although it can scan the input to generate descriptions like his father; and the Aggregation modules is responsible for mixing together clauses with similar structure, in order to avoid repetition.
We have presented here a biography generation system with three components. A prototype of the learning component inferred plans for an earlier domain [1]. A new version of it, focusing in selecting appropriate pieces of content for a given biographical task, succeeded in halving the available data by two, while keeping the correct data for further verbalization [2]. The generation component has currently five operational modules, at different levels of completion. The remainder two modules are the Lexical Chooser, undergoing knowledge acquistion, and the Aggregation Module, on the design phase. PROGENIE solves an existing requirement for intelligence and law enforcement personnel. Its design has highly benefited from interviews with potential users; this fact is reflected on the architecture presented here. We plan to have an integrated biographies generator by the end of 2003, operating over publicly available Internet sources.