Back to Articles
Introducing SIMON: An Open-Source Tool for Analyzing Semantic NLP in Email and Social Media Content
SIMON, or Semantic Inference for the Modeling of Ontologies, is an open-source text classification tool developed by NK Labs in Austin, TX. Using character-level analysis, SIMON enables researchers to study natural language samples such as blogs, email and social media posts with impressive accuracy and greater efficiency than existing programs. The project is supported by DARPA, and is part of a machine learning system that aims to replace some of the more tedious work data scientists do.
Down to the Letter
SIMON’s technical features support increasingly precise and efficient study of online communication. Because it operates at the character-level as opposed to word or sentence-level, SIMON can be applied with minimal modification to different languages, different alphabets and even non-linguistic character collections such as emoji.
Which “Teddy”? Contextual Analysis
With Bidirectional functionality, SIMON’s natural language analysis is able to look forward and backward to fully understand the meaning of content and the radically different ways we use common terms.
“He said teddy bears are on sale.”
“He said Teddy Roosevelt was a great president.”
In the examples above, “teddy” clearly means two very different things. A system that has read up to that word and which only reads in one direction could not distinguish the two contexts.
How Was that Taco?
With SIMON a user can gain a better understanding of context and tone.
“I want a taco bad.”
“I just had a taco. It was bad.”
In the examples above, SIMON’s sentiment analysis can determine the tone of the first sentence as positive and the second sentence as negative. While sentiment analysis itself isn’t new, SIMON’s ability to analyze sentiment at the character level enables it to be more effective by gracefully accommodating spelling errors, language variations and other variables.
Going Beyond the Usual ML Analysis
SIMON is part of DARPA’s AutoML system that automates data science pipelines. To get started, a user supplies a task in natural language, such as determining whether an email is from a legitimate sender or from a bot trying to sell you a new mortgage. The machine applies all necessary preprocessing and machine learning algorithms, and communicates the results to the user in natural language. SIMON’s role in the AutoML system is preprocessing to semantically classify tabular data.
SIMON goes beyond traditional natural language analysis by relying on Transfer Learning, which is still a relatively new practice in natural language processing. SIMON learns an initial set of data and then applies that knowledge to another domain with significantly reduced training requirements.
SIMON is able to take a tabular dataset and semantically classify it into base classes (such as integers, data strings, addresses, etc.) that can be used for appropriate analysis. Whereas most traditional learning algorithms (like Pandas.dtypes) are single-label classification algorithms, SIMON is defined as a multi-class and multi-label classification formulation. In other words, SIMON is able to take a single input column such as cities and classify the data as a city, a name and as text (string). This means SIMON can identify data as multiple labels or classes, such as recognizing data as a mailing address or a geographic location, or both. SIMON’s versatility is powerful, uncommon and at the forefront of current NLP research trends.
High Success Rate Without Metadata
To test SIMON’s efficacy, NK Lab’s data scientists procured 38 open-source data sets collected and manually annotated by MIT’s Lincoln Labs, and assessed their baseline similarity score. They defined the “similarity score” as a percentage of labels in which any of the automated annotations matched the manual annotations produced by human judgement. Whereas the more traditional Pandas.dtypes annotation achieved an average similarity score of 71 percent, SIMON achieved a similarity score of 92 percent. SIMON also outperforms the manual annotations in that it provides a lot more insight into the data, e.g. when processing tabular data containing physical addresses, SIMON recognizes a “Zip” column not merely as a series of numbers, but also as a postal code.
When testing with email spam classification, SIMON achieved ~99 percent results in training, validation and test accuracy, which amounts to parity with existing state-of-the-art spam classification systems. But unlike those systems, which leverage a host of metadata beyond the content of the message (such as mail server configuration, originating IP address, etc.), SIMON achieves these high results with just the raw semantic elements of the characters and does not use any metadata.
SIMON in Action
Interested in seeing how useful SIMON is firsthand? Check it out in our Github repository here. Want to see it in action? NK Labs will demonstrate SIMON at the AI Community Mixer and Showcase during SXSW 2019. Registration is free and a badge is not required.
Where Does SIMON go from here?
Having experienced the benefits of using character-based, bi-directional Transfer Learning in natural language analysis, NK Labs hopes the work utilized in developing SIMON advances state-of-the-art machine learning. NK Labs is now augmenting SIMON to better support parallel computing (multi-GPU, multi-CPU) and to improve the explainability of neural networks and transfer learning in natural language.