break

Text Mining

This is a research paper I turned in for my Computing Seminar class at Harding University.

If you have any comments about my research, or find anything that is a concern please e-mail me at: daniel@danielisaacs.com.

Text Mining
Written by: Daniel Isaacs 

Introduction

As we enter the second decade of the World Wide Web, the textual revolution has seen a tremendous change in the availability of online information. Finding information for just about any need has never been more automatic, just a keystroke or mouse click away. While the digitalization and creation of textual materials continues at light speed, the ability to navigate, mine, or casually browse through documents too numerous to read (or print) lags far behind.

How can we make use of all the information we find on the internet, sorting out the useful from the not so useful? One way is to use keyword search engines but it has its limitations. In many languages (especially in English) the same word can have multiple, completely separate meanings. This fact brings a problem of context. A simple keyword search ignores the context and blindly returns just what was asked for. An alternative is to have human knowledge engineers sort through the information, and classify the documents into hierarchies. That would allow searches to browse through the hierarchies looking for items we want. This sort of manual hierarchical classification is extremely effective for retrieval but is enormously expensive in manpower.

What we need to do is to exploit the speed of computers, but give them the power to look beyond simple keyword searching. Text Mining provides this ability.

Text mining roots

The roots of what we now call text mining are deep in the area of information retrieval. Text mining methods like document classification which is similar in many ways to document indexing was extensively studied in the late 1950s and 1960s. (Weiss 13)

Representation of a document as a bag of words, each with an attached frequency measure, became popular by the 1970s. (Weiss 13) As artificial intelligence methods became more popular in the 1980s, they also were applied to problems we now classify as text mining. (Weiss 13) The real impetus for the kinds of applications we see today comes from two sources: the availability of cheap, fast computing and of enormous amounts of text in digital form. In the past, when text collections were formed by punching them into cards or paper tape before being read into an expensive computer with limited memory, only a few centers could do the kinds of experiments now conducted on desktop computer systems.

Much of the advancement of technology in text mining has come in connection with government sponsored challenge competitions, each capped by a conference in which the participants present their results.

What is text mining?

Text mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources (Hearst).

Text Mining is loosely defined as “data mining for text”. It involves using natural language processing and statistical/probabilistic techniques for finding nuggets and patters in text collections (Ford). A nugget is a new, unknown item where a pattern is the occurrence of a known item.

In text mining, the goal is to discover unknown information, something that no one yet knows and so could not have yet written down.

Text Mining is a variation on a field called data mining, which tries to find valuable patterns in data. The difference between regular data mining and text mining is that in text mining the patters are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to process automatically; text is usually a collection of unstructured documents with no special requirements for composing the documents.

Text mining phases

Text mining can be visualized as consisting of two phases: text refining and knowledge distillation. Text refining transforms free form text documents into a chosen intermediate form. Knowledge distillation deduces patters or knowledge from the intermediate form.

Intermediate form can be semi-structured or structured. Intermediate form can also be document based or concept based.

A document based intermediate form for each entity represents a document. (Tan) Mining a document based intermediate form deduces patterns and relationship across documents. Clustering and organizing documents and document classification are examples of mining from a document based intermediate form.

A concept based intermediate form; each entity represents an object or concept of interests in a specific domain. (Tan) Mining a concept based intermediate form derives pattern and relationship across objects or concepts. Data mining operations, such as predictive modeling and associative discovery, fall into this category.

A document based intermediate form can be transformed into a concept based intermediate form by realigning or extracting the relevant information according to the objects of interests in a specific domain. It follows that document based intermediate form is usually domain independent and concept based intermediate form is domain dependent. For example, given a set of news articles, text refining first converts each document into a document based intermediate form. One can then perform knowledge distillation on the document based intermediate form for the purpose of organizing the articles, according to their content, for visualization and navigation purposes. For knowledge discovery in specific domain, the document based intermediate form of the news articles can be projected onto a concept based intermediate form depending on the task requirement. For example, one can extract information related to “company” from the document based intermediate form and form a company database. Knowledge distillation can then be performed on the company database to derive company related knowledge.

Text mining methods

The methods that will be discussed apply to major areas in text mining which are: clustering and classification, information retrieval and trend detection. I do not pretend to be exhaustive, but rather have chosen to present some of the main approaches.

The first method that will be discussed is on the discovery of similar words from a large corpus, and it is applied to clustering and classification. One popular approach on the discovery of similar words is the SEXTANT System implemented by Gregory Grefenstette. SEXTANT can be run on any English text, without any pre-coding of domain knowledge or manual editing of the text (Berry 29). The input text passes through the following steps: morphological analysis, where each word is morphologically analyzed and looked up in a 100,000 word dictionary to find its possible parts of speech. The second step is grammatical disambiguation in which a stochastic parser assigns one grammatical category to each word in the text. The third step is noun and verb phrase splitting. Each sentence is divided into verb and noun phrases by a simple regular grammar. The fourth step is syntagmatic relation extraction, in which a four pass algorithm attaches modifiers to nouns, noun phrases to noun phrases and verbs to noun phrases. The fifth step is context isolation, where the modifying words attached to each word in the text are isolated for all nouns. Thus the context of each noun is given by all the words with which it is associated throughout the corpus. The last step is similarity matching. Contexts are compared by using similarity measures.

The second method discussed is vector space model which is applicable to information retrieval. It was originated by Gerard Salton and it represents documents as vectors in a vector space. Vector space model is also a way of representing documents through the words that they contain. It also allows decisions to be made about which documents are similar to each other and to keyword queries. (Berry 105)

Its basic idea is to represent each document as a vector of certain weighted word frequencies. In order to do so, the following parsing and extraction steps are needed. The first step is to extract all unique words from the entire set of documents. Then it eliminates non content bearing stop words such as “a”, “and”, “the”, etc. The third step is that for each document, it counts the number of occurrences of each word. The fourth step is using heuristic or information theoretic criteria, it eliminates non content bearing high-frequency and low-frequency words. The last step is that after the above elimination, suppose “w” unique words remain. Assign a unique identifier between 1 and “w” to each remaining word, and a unique identifier between 1 and “d” to each document. This steps outline a simple preprocessing scheme which yields the number of occurrences of word j in document i, say, fji, and the number of documents which contain the word j, say, dj . Using these counts, we can represent the i-th document as a w-dimensional vector xi as follows. For 1 <= j <= w, set the j-th component of xi, to be the product of three terms xji = tji * gj * si where tji is the term weighting component and depends only on fji, while gj is the global weighting component and depends on dj , and si is the normalization component for xi. Intuitively, tji captures the relative importance of a word in a document, while gj captures the overall importance of a word in the entire set of documents. The objective of such weighting schemes is to enhance discrimination between various document vectors for better retrieval effectiveness.

The third method discussed is the Zipf’s law which was named after the Harvard linguistic professor George Kingsley Zipf and it is related to trend and behavior detection from web queries. One of the enduring curiosities in text analysis and mining has been the fit of Zipf’s law and its variation. Because query statements submitted to the search engines were from diverse users and extremely short as a text unit, it is difficult to predict if the vocabulary will show similar statistical distributions.

The objective of this law is to know the frequency of use of the nth most frequently used word in any natural language is approximately inversely proportional to n. (Berry 178) Zipf’s law is an experimental law, not a theoretical one. Zipfian distributions are commonly observed in many kinds of phenomena. The causes of Zipfian distributions in real life are a matter of some controversy, however. Zipf’s law is often demonstrated by scatter plotting the data, with the axes being log (rank order) and log (frequency). If the points are close to a single straight line, the distribution follows Zipf’s law.

The classic case of Zipf’s law is a “1/f function”. Given a set of Zipfian distributed frequencies, sorted from most common to least common, the second most common frequency will occur 1/2 as often as the first. The third most common frequency will occur 1/3 as often as the first. The nth most common frequency will occur 1/n as often as the first.

Some examples that follow Zipf’s Law are the frequency of accesses to web pages, words in the English language, income distribution amongst individuals, etc.

Applicable areas

Text mining is applicable to areas such as: document classification, information retrieval, clustering and organizing documents, information extraction and prediction and evaluation.

Text categorization is the widely used, but ponderous, name for document classification. Documents are organized into folders, one folder for each topic. A new document is presented, and the objective is to place this document in the appropriate folders. (Weiss 53) For example, we might have a folder for household or financial documents and we want to add new documents to the correct folder. The application is almost always binary classification because a document can usually appear in multiple folders.

Originally, this type of problem was considered a form of indexing, much like the index of a book. As more and more documents have become available online, the applicability of this task has broadened. Some of the more obvious tasks are related to e-mail: for example, automatically forwarding e-mail to the appropriate company department or detecting spam mail.

Information retrieval is the topic most commonly associated with online documents. A collection of documents is obtained, we give clues as to the documents that we want to retrieve from the collection, and then documents matching the clues are presented as answers to our query. (Weiss 85) Two questions may arise, what are the clues, and how are they used to retrieve relevant documents? The clues are words that help identify the relevant stored documents. In a typical instance of invoking a search engine, a few words are presented, and these words are matched to the stored documents. The best matches are presented as the responses. The process can be generalized to a document matcher, where instead of a few words; a complete document is presented as a set of clues. The input document is then matched to all stored documents, retrieving the best-matched documents.

A basic concept for information retrieval is measuring similarity: a comparison is made between two documents, measuring how similar the documents are. For comparison, even a small set of words input into a search engine can be considered as a document that can be matched to others. From one perspective, measuring similarity is related to predictive methods for learning and classification that are called nearest-neighbor methods. The common theme is measuring similarity, and variations of these methods are fundamental to information retrieval.

For text categorization, we saw that the objective was to place new documents into the appropriate folders. These folders were created by someone with knowledge of the document structure, someone who knew the expected topics. However, what if we have a collection of documents with no known structure? For example, a company may have a help desk that receives and records calls by users of their products. The company might want to learn about the categories and types of complaints. The general objective of clustering and organizing documents is that given a collection of documents; find a set of folders such that each holds similar documents. (Weiss 106)

The clustering process is equivalent to assigning the labels needed for text categorization. Because there are many ways to cluster documents, it is not quite as powerful a process as assigning answers (i.e., known correct labels) to documents. Still, clustering can be insightful. By studying key words that characterize a cluster, a company could learn about its customers. In the help-desk example, a computer company might find that the largest cluster of complaints is for networking problems. It might also identify an unexpected type of problem, documents indicating many calls for help, for which they do not have a good solution.

Let’s concentrate on how text mining works on information extraction. Our representation of data looks at information in terms of words. This is a rudimentary formulation that is surprisingly successful for many applications. In comparison with classical numerical data mining representations, these measurements are very shallow. We are only measuring the presence or absence of words. A data mining representation may look the same, but the measurements themselves will be far broader and more complex. They may be real valued variables, such as sales volume, or a code, such as an industry code like “auto industry.” Someone is responsible for defining these attributes and depositing them in a database.

An alternative way of looking at these distinctions is that data mining expects highly structured data, and text is naturally unstructured. To make text structured, we have employed a very shallow representation that measures the simple occurrence of words. Information extraction is a subfield of text mining that attempts to move text mining onto an equal footing with the structured world of data mining. The objective of information extraction is to take an unstructured document and automatically fill in the values in a place we would like to store it, for example a spreadsheet. (Weiss 129)

A database, at least one organized by fields or tables, is structured. When the information is unstructured, such as the found in a collection of documents, then a separate process is needed to extract data from an unstructured format. For example, we may examine documents about companies and extract the sales volumes and industry codes from text that has not been structured for storage in a database. The attribute that is measured will not have a fixed position in the text and may not be described in the same way in different documents. For example, if we have chosen to use a spreadsheet model, the objective is to fill in the cells. The columns are not just words but can be higher level concepts that are found by the information extraction process. This process examines documents and fills in the cells, a process that is equivalent to populating a database table. Once the process is completed, the usual learning methods can be applied.

Lastly, let’s address prediction and evaluation. Our ultimate goal is prediction, projecting from a sample of prior examples to new unseen examples (Berry 173). The learning program studies documents and finds some generalized rules that will give correct answers on new examples. However, how do we know that we will be successful on new examples? The classic approach is to “hold out” some examples with known answers, not allowing the learning program to train on those examples. These new examples are used solely for evaluation. For many text mining scenarios, the holdout evaluation will be effective (i.e., assigning topics to news stories, such as financial or sports stories). Even so, there are special twists. News stories change over time, we must be sensitive to dates of publication in our selection of test documents.

One basic concepts of prediction is the measurement of error. For topic assignment, we can readily determine whether a program’s answer is right or wrong. The classical measures of accuracy will be applicable, but not all errors will be evaluated equally. That’s why measures of accuracy such as “recall” and “precision” are especially important to document analysis.

Conclusion

The large volumes of textual information proliferate at a disastrous rate, and the human eyes and brain are increasingly unable to meet the challenges of this growth. Nonetheless, text mining is a field that deals with this challenge. It helps us to extract the meaning of a text in a concise form, efficiently navigate through large text bases and to perform natural language information retrieval. However, they are some problems that need to be solved. It remains a challenge to see how semantic analysis can be made much more efficient and scalable for every large text corpora. Whereas data mining is largely language independent, text mining involves a significant language component. It is essential to develop algorithms that process multilingual text documents and produce language independent intermediate forms. Mankind is already searching for intelligent electronic assistants that can help in text analysis, which in my opinion will mean that in a near future, text mining will be something common in everybody’s life.

 

Bibliography

Berry, Michael. Survey of Text Mining: Clustering, Classification, and Retrieval. New York: Springer, 2005.

Ford, Roger. “Text Mining with Oracle Text”. ZDNet. Oracle. January 2005.

< http://whitepapers.zdnet.co.uk/0,39025945,60091534p-39000454q,00.htm>

Hearst, Martin. “Untangling Text Data Mining”. The ACM Digital Library. 26 June 1999.

<http://portal.acm.org/citation.cfm?id=1034678.10346797coll=GUIDE&dl=GUIDE &type=series&idx=1034678∂=Proceedings&WantType=Proceedings&
&title=Annual%20Meeting%20of%20the%20ACL&CFID=39167266&
&CFTOKEN=3048127>

Nasukawa, T and Nagano, T. “Text analysis and knowledge mining system”. IBM Systems Journal. 26 June 2001.

< http://www.research.ibm.com/journal/sj/404/nasukawa.html>

Tan, Ah-Hwee. “Text Mining: The state of the art and the challenges”. Electronic Warfare Associates, Inc. 18 May. 2001.

< http://www.ewastrategist.com/papers/text_mining_kdad99.pdf>

Weiss, Sholom, et al. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer, 2005.