Tags: UIMA, Python
I have been using a lot of Python lately in work for a customer. Programming python has many positives, but when it comes to processing large amounts of text, I still choose Apache UIMA.
In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.
This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons.
The original version of this post was at the (now defunct) Hack the Job Market blog from MatchFWD, an (also now defunct) local startup. I reproduce it here.
Python is great not only for its flexibility but also for its ability to interface with other languages.
In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.
This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons. The dictionary itself is an XML file that looks like this (see SVN for the full sample):<?xml version="1.0" encoding="UTF-8" ?> <synonym> <token canonical="United States" DOCNO="276145"> <variant base="United States"/> <variant base="United States of America"/> <variant base="the United States of America"/> </token> </synonym>In our case it looks something like this:
<token canonical="Community Managers" french="Gestionnaire de communauté"> <variant base="Writer & Community Manager" source="1"/> <variant base="Marketing and Community Manager" source="1"/> <variant base="Community Manager" source="2"/> <variant base="Blog Editor" source="1"/> <variant base="Community manager social media" source="2"/> </token>The power of Concept Mapper lies in the fact that it structures all the entries in RAM and matches against large dictionaries in linear time. Moreover, you can add any extra information in the XML and refer them back from within UIMA by using custom types. In our case, we have job titles, aliases and extra information (such as language, source information, etc). Setting a Java pipeline to use the UIMA Concept Mapper Annotator requires fiddling with three XML descriptors (unless you are using uimaFIT, for descriptorless UIMA, but I haven't tried it yet with Concept Mapper) and a type descriptor. The changes to the type descriptor are needed for modifying the descriptor files, so let's start there. To access extra fields in the dictionary XML file, you need to change DictTerm.xml and define UIMA features for each piece of information:
<featureDescription> <name>DictCanon</name> <description>canonical form</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> <featureDescription> <name>DictFrench</name> <description>French form</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> <featureDescription> <name>DictSource</name> <description>Source for the alias</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription>The three descriptor files are
<nameValuePair> <name>TokenizerDescriptorPath</name> <value> <string> /path/to/OffsetTokenizer.xml </string> </value> </nameValuePair>Concept Mapper needs to know how to map the attributes in the dictionary XML to the features in the dictionary term. This is accomplished by two parameters, AttributeList and FeatureList, both arrays of strings which should be of the same length (this is quite ugly, somebody feels like submitting a patch for it? *grin*):
<nameValuePair> <name>AttributeList</name> <value> <array> <string>canonical</string> <string>french</string> <string>source</string> </array> </value> </nameValuePair> <nameValuePair> <name>FeatureList</name> <value> <array> <string>DictCanon</string> <string>DictFrench</string> <string>DictSource</string> </array> </value> </nameValuePair>Finally, at the end of ConceptMapperOffsetTokenizer.xml you need to point to your dictionary:
<fileResourceSpecifier> <fileUrl>file:/path/to/your/dictionary.xml</fileUrl> </fileResourceSpecifier>
public class Analyzer { private AnalysisEngine ae; private CAS aCAS; private TypeSystem ts; private Type termType; private Feature sourceFeat; public Analyzer() throws IOException, InvalidXMLException, ResourceInitializationException { XMLInputSource in = new XMLInputSource("/path/to/OffsetTokenizerMatcher.xml"); ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in); ae = UIMAFramework.produceAnalysisEngine(specifier); aCAS = ae.newCAS(); ts = aCAS.getTypeSystem(); termType = ts.getType("org.apache.uima.conceptMapper.DictTerm"); sourceFeat = termType.getFeature("DictSource"); } public String[] analyze(String s) throws AnalysisEngineProcessException { aCAS.reset(); // reuse the Common Annotation Structure object aCAS.setDocumentText(s); // set the text to analyze ae.process(aCAS); // run concept mapper FSIterator<AnnotationFS> it = aCAS.getAnnotationIndex(termType).iterator(); // jobs titles? if (!it.hasNext()) { return new String[0]; // no job titles, nothing to see here } else { // return the first one, with offsets and source AnnotationFS ann = it.next(); return new String[]{ String.valueOf(ann.getBegin()), String.valueOf(ann.getEnd()), ann.getFeatureValueAsString(sourceFeat) }; } } }Now, calling this from Python is very straightforward. I am using an embedded JVM through JPyPe:
from jpype import java if not jpype.isJVMStarted(): _jvmArgs = [ "-Djava.class.path=/path/to/uima/jars/uimaj-bootstrap.jar:/path/to/uima/jars/uimaj-core.jar:/path/to/uima/jars/uimaj-document-annotation.jar:/path/to/uima/jars/uima-ConceptMapper.jar:/path/to/your/code/bin") jpype.startJVM("/path/to/your/jvm/libjvm.so", *_jvmArgs) Analyzer = jpype.JClass("your.packages.Analyzer") analyzer = Analyzer() analysis = analyzer.analyze("Sample text talking about what are the tasks for a Blog Editor") if len(analysis) > 0: print 'start', analysis[0] print 'end', analysis[1] print 'source', analysis[2]Our internal tests show processing 200+ megabytes of text against a dictionary with hundreds of thousands of entries takes a little bit more than 4 minutes. While the integration is working, is still far from being nice or "pythonic". I hope to gather enough insights from this to contribute to an outstanding NLTK ticket since 2010.