NLG for Statistical Reports

MTL Data Meetup 2015/02/11

Created by Pablo Duboue / @pabloduboue

What is NLG

Natural Language Generation

The other NLP

Decisions, decisions, decisions

Continuous enrichment

Pablo Duboue

My thesis (defended ten years ago!) was in Machine Learning for NLG. I worked in two full NLG systems:

  • MAGIC, a bypass surgery report system written in LISP.
  • ProGenIE, a biography generator written in Java.

Even though I gravitated towards ML and IR, half my papers are in NLG and I'm coming back to the field. (I recently run for a position in the board of SIGGEN.)

Standard Pipeline

Adapted from

Document Planner

content determination decides what information will appear in the output text. This depends on what your goal is, who the audience is, what sort of input information is available to you in the first place and other constraints such as allowed text length.

document structuring decides how chunks of content should be grouped in a document, how to relate these groups to each other and in what order they should appear. For instance, when describing last month’s weather, you might talk first about temperature, then rainfall. Or you might start off generally talking about the weather and then provide specific weather events that occurred during the month.

OpenSchema performs both tasks.


lexicalization decides what specific words should be used to express the content. For example, the actual nouns, verbs, adjectives and adverbs to appear in the text are chosen from a lexicon. Particular syntactic structures are chosen as well. For example you can say ‘the car owned by Mary’ or you might prefer the phrase ‘Mary’s car’.

referring expressions decides which expressions should be used to refer to entities (both concrete and abstract). The same entity can be referred to in many ways. For example March of last year can be referred to as:

  • March 2014
  • March
  • March of the previous year
  • it


aggregation decides how the structures created by document planning should be mapped onto linguistic structures such as sentences and paragraphs. For instance, two ideas can be expressed in two sentences or in one:

The month was cooler than average.The month was drier than average.


The month was cooler and drier than average.

Surface Realiser

linguistic realisation uses rules of grammar (about morphology and syntax) to convert abstract representations of sentences into actual text.

structure realization converts abstract structures such as paragraphs and sentences into mark-up symbols which are used to display the text.

SimpleNLG performs the last part, namely surface realisation.

Statistical Reports

PostGraphe, a system developed as part of
Dr. Fasciano's thesis at UdeM

Basic intentions covered in PostGraphe:

  • The presentation of a variable
  • The comparison of variables or sets of variables
  • The evolution of a variable along another one
  • The correlation of variables
  • The distribution of a variable over another one


OpenSchema takes care of selecting what to say and structuring the selected information. This is achieved by going executing an augmented transition network (ATN), which for the purposes of this software package it is a grammar for a regular language (think regular expressions) over discourse predicates defined also as part of the schema itself.

Input: RDF

RDF is a graph description notation used in the Semantic Web.

Output: Clauses

The output of the OpenSchemaPlanner is a DocumentPlan, which contains a list of paragraphs, each of which is a list of aggregation segments. Finally, an aggregation segment is a list of clauses, where each clause is a hierarchical attribute-value matrix, represented as a java Map from Strings to Object.


schema biography(self: c-person)  
  ; name of the schema 'biography'
  ; self is the person the bio is about, required

  ; first paragraph, the person
  star ; zero or more aliases
  star ; zero or more parents
  star ; zero or more education


predicate pred-person
    req def person : c-person 
    occupation : c-occupation
  properties  ; properties that the variables have to hold
    occupation == person.occupation
    ; use this for template generation
    template "{{name-first}} {{name-last}} is a {{occupation}}. "
    occupation    occupation.#TYPE
    ; use this preds for SimpleNLG
    pred attributive
    pred0 person 
    pred1 occupation


Tutorial, adapted from

        Lexicon lexicon = Lexicon.getDefaultLexicon();
        NLGFactory nlgFactory = new NLGFactory(lexicon);
        Realiser realiser = new Realiser(lexicon);

        NLGElement s1 = 
            nlgFactory.createSentence("my dog is happy");

        String output = realiser.realiseSentence(s1);

My dog is happy.

        SPhraseSpec p = nlgFactory.createClause();
        p.setObject("the monkey");

        String output = realiser.realiseSentence(p);

Mary chase the monkey.

        NPPhraseSpec subject1 = 
        NPPhraseSpec subject2 = 
          nlgFactory.createNounPhrase("your", "giraffe");

        CoordinatedPhraseElement subj = 
          nlgFactory.createdCoordinatedPhrase(subject1, subject2); 

Mary and your giraffe chase the monkey.

        NPPhraseSpec object1 = 
            nlgFactory.createNounPhrase("the monkey");
        NPPhraseSpec object2 = 

        CoordinatedPhraseElement obj = 
            nlgFactory.createdCoordinatedPhrase(object1, object2); 

        obj.setFeature(Feature.CONJUNCTION, "or");

Mary and your giraffe chase the monkey, George or Martha.

        p.setFeature(Feature.TENSE, Tense.FUTURE);
        p.setFeature(Feature.NEGATED, true);

Mary will not chase the monkey.


Going from the OpenSchema predicates to SimpleNLG class:

  • Aggregation
  • Lexicalization

Coming soon

Data Visualization

Dimensionality reduction

What is missing?

Capturing generalizations




Do you want to learn more?

I have here the material for my NLG class from 2011.
(A graduate level, semester long course.)

Data Science Córdoba

Lessons learned:

  • Visiting companies is a great idea.
  • Publishing Linked-In roosters.


Keep in touch with Pablo at: