Last year I finally had the opportunity to work on a real-life AI scenario. One of our customers was looking to auto-tag their data to improve findability. Based on the customer needs a colleague (Rutger) and I spend some time preparing a Proof of Concept. The goal of this PoC was to prove that AI can help in classifying the data. Rutger had started working at Mavention only a few months ago after graduating here. His graduation was on using AI and chatbots to interact with one of our products. When writing his thesis he spent some time on language understanding with LUIS. So when our customer asked us to see what we could do to classify their data we ended up spending some time together. We drafted some requirements to prove that we could get it all together.
The use case
Imagine a customer that has migrated to SharePoint Online. They run in a hybrid scenario due to legislation. They also have terabytes of old data that might one day be migrated. So there is a set of data that needs to updated with classification or metadata. The data is saved on file systems or network shares. There is no real way to identify the type of data as there is no metadata. There is some metadata in the contents of the files itself.
Using SharePoint Search
The quickest way to provide insights is to add the file share to the SharePoint Search results. Legislation can be met by adding it to the local index rather than include it in SharePoint online. That way on-premises users can search their file share for files. And SharePoint provides a full-text experience. SharePoint itself provides both Custom Content Enrichment to enrich your search results. The search engine also provides Entity Extraction to help determine the type of content.
Using Azure AI
Azure provides several AI solutions that can help to work with content. You can use AI solutions to determine the language of a file. You can extract several known properties as locations, keywords, numbers, and names. And you can use LUIS to determine the context of specific elements. The only downside of these AI services is that you have to feed it small chunks of information. You cannot push in the complete document. The Azure AI services work with between 400 and 5000 characters at max. So you will end up with some custom code to split up content.
Back to our proof of concept. We have added our documents to SharePoint search for full-text search capabilities. We then implemented a content enrichment service. The content enrichment service allows you to extend the search results. Adding such a service allows you to add additional metadata to your search result. In our case, we used this content service to retrieve the contents of the document. These contents then were scrubbed to retrieve some metadata we could determine. We retrieved the footer of the document. Additionally, we retrieved the first few pages to check for a cover page with information. Based on this data we validated if the security classification allowed us to process it in the cloud.
If the data could be processed in the cloud it sends to Azure AI services to determine language. Once the language is detected the correct LUIS training model is used to get intents and entities. This information can be passed back to the Content Enrichment service. By returning this information you can re-use it in your SharePoint Search center. The result is that you can use LUIS entities as refiners and filters.
Samples we played around where simple ones like project numbers or ISBN numbers. As well as more complex ones like authors (based on their display names). You can even train a model for topics for reports and letters. By injecting that data into the search index you can use those entities as refiners. Another option is to build your own display template to show those properties to your users.
I loved it!
I loved playing around with some of the AI options we have in a real-life scenario. The SharePoint Search experience combined with AI services proved to be a powerful combination. It allowed us to extract massive quantities of data that was not available before. In only a few days we managed to set-up a Proof of Concept. This allowed us to prove what type of data we could retrieve. And as training LUIS takes minimal effort we could show the whole process in a matter of days. Once you have proof that a scenario works you can move to a larger project to train a more complex model. Or retrieve more information using the AI services.
Originally posted at: https://www.cloudappie.nl/ai-classify-sharepoint-data/