Monday, July 18, 2016

Faking NLP With Lucene

Since I'm going back to work at Microsoft soon, I decided I ought to start up a C# / .NET project to stretch my skills a bit. I went looking for good ideas and came up with a Microsoft Bot Framework bot to answer queries for public traffic camera data. I started a Github project here, so check it out. I will blog a little about it as opportunities arise.

Once thing I wanted to do was support as much of a natural conversational style as possible, as is the trend with bots. Here is an example dialog between me and my bot:

I spent a little bit of time investigating how I might do "true" natural language processing (NLP) to answer queries like "Show me traffic at Sunset", perhaps using something like Stanford CoreNLP. The question then arose as to how I would train an appropriate model. With traffic cameras, the camera names sometimes look sorta like addresses, which is clearly trainable. But many times, they don't, and in fact I finally decided they didn't really fall into any trainable pattern whatsoever.

Instead, I decided to apply search techniques. I set up a Lucene index. Each document in the index represents one traffic camera. I added text to it with different combinations of possible abbreviations. For example, a camera named "NE 85th St" might be added to the index with a document like:

ne 85th st
northeast 85th st
ne 85th street
northeast 85th street

        private Document CreateDocument(string title, IEnumerable<string> altNames)
        {
            var doc = new Document();
            doc.Add(new Field(CAMERA_NAME_FIELD, title, Field.Store.YES, Field.Index.NOT_ANALYZED));
            var sb = new StringBuilder();
            sb.Append(title);
            sb.Append('\n');
            foreach (var altName in altNames)
            {
                sb.Append(altName);
                sb.Append('\n');
            }
            var content = sb.ToString();
            doc.Add(new Field(CONTENT_FIELD, content, Field.Store.YES, Field.Index.ANALYZED));
            return doc;
        }

When it comes time to process a query, we first look for exact (conjunctive terms) and fuzzy matches. This will fail for the natural-language style text. So at that point (and kind of as a last resort), the whole query string gets passed directly to the index with no preprocessing other than lowercase normalization, and the results scored. All of the documents passing a certain threshold are returned.

        private IList<string> RunQuery(Query query)
        {
            var collector = TopScoreDocCollector.Create(MAX_SEARCH_RESULTS, false);
            searcher.Search(query, collector);
            var scoreDocs = collector.TopDocs().ScoreDocs;
            logger.Debug("Searching for " + query +
                ", top score is " + (scoreDocs.Any() ? scoreDocs[0].Score.ToString() : "non-existent"));
            var results = from hit in scoreDocs
                          where hit.Score >= HitScore
                          select searcher.Doc(hit.Doc).Get(CAMERA_NAME_FIELD);
            return results.ToList();
        }
If there is only one matching document, we have achieved "magic" and present the camera directly to the user. Otherwise (as in the example dialog above), we present a choice menu.

What I found in practice is that this works for a wide variety of query and camera names. Typically, the desired camera document(s) will have a score around 0.3, and there is an order-of-magnitude drop-off in the scores of other "matching" documents (which perhaps just match a generic term like "avenue").

So at the end of the day, with no true NLP algorithms in play at all, it seems the bot can do a fairly decent job of handling natural-language style queries in this limited domain.

No comments: