What is it really all about?
Vocabulary Building
Enterprise vocabularies are typically built to suit the content used by the organization. In the process of creating or maintaining this vocabulary, terms are added or modified over time as new concepts are discovered or needs are identified. While text analytics can be viewed as an exploratory tool for content, analyzing known content for terminology to include in enterprise vocabularies is a way of reinforcing established concepts.
Deliberately selecting a small body of known training content to match to existing vocabulary terms is a way of ensuring the taxonomy is still providing adequate coverage for existing content. By the same token, known content can be used to identify and extract unknown concepts to build out the taxonomy in areas which are undeveloped. Either method is a way to build and maintain the enterprise taxonomy for use in categorizing content.
Auto-Categorization
One of the most common enterprise use cases for text analytics processes is the auto-categorization of content. While using controlled vocabularies offer benefits in the application of consistent metadata to content for identification and retrieval, the barrier to taxonomy adoption has often been the labor-intensive building, maintaining, and application of taxonomies and their values to content. Tools may be getting better at the automatic creation of taxonomies, but few, if any, offer taxonomy generation in any complete and usable state. What is automated, however, is the application of taxonomy values as metadata to content.
Auto-categorization works best with known vocabularies and known content. For example, a news publisher may write and publish news articles which need to be discoverable on a publicly accessible website. Onsite search or web search applications index content and make it discoverable. Most work from the content itself and use various methods to rank the page for returned results. The most notable, of course, is Google’s PageRank. Although modern search engines are far more sophisticated than simply matching keywords, having embedded meta tags which describe the document as a whole are best supplied from a common vocabulary so they are consistent on all content across the site and even between sites. The rapid velocity of news story generation, publication, and sharing requires meta tags to be applied more quickly than is practical by manual application. Thus, using a text analytics tool to identify and match concepts appearing in the content to controlled vocabulary concepts speed the application of metadata. As content changes between versions and over periods of time, the tagging taxonomy is also continually updated to cover concepts. Likewise, the text analytics tool continues to evolve rules-based categorization so concepts which are directly or even indirectly found in text can be tagged.
