Text Mining

Published by James Taylor

INTRODUCTION

With the increase in technological advances, more and more data is available in digital form. Nevertheless, most of this data is available in unstructured textual form thus making it essential developing better techniques that will enable extraction of interesting and useful information from the bulk textual data (Miner, 2012). The process through which extraction of this information is executed is referred to as text mining. It is, however, imperative noting that text mining should not be confused with data mining since the two are distinct disciplines. The process involves various stages such as text pre-processing, text cleanup and post-processing. It is on this premise that text mining and text analytics has become an essential aspect of research. In this chapter, I will discuss a methodology on how to extract interesting and useful news from a database. Different existing text mining procedures and algorithms will be the primary areas of concern in the project.

From the definition, the process of text mining involves a system that analyzes large quantities of natural language text and detects lexical or linguistic usage patterns in an attempt to extract useful or valuable information. In this case, text mining process will be of great importance in retrieving emerging news from a news database. Use of information retrieval will be of great significance in this process (Berry & Kogan, 2010). This process will have an unlimited number of social and economic benefits. For instance, the mining of emerging news will have impacts on social and economic fields. The society can learn emerging trends in the economy or the social world thus taking advantage for better lives. Through text mining and analytics, I will be able to extract new knowledge and hidden insights from the large data set that may be of paramount importance to the society (Weiss, 2005).

BACKGROUND

The background of this work will be presented in two subsections. The first one will present the association rule clustering while the latter will be concerned with frequent pattern mining.

association rule and clustering

Association rule in text mining has been in use for a very long time. This concept helps in uncovering the relationships between seemingly unrelated data in a relational database or another information repository. The rule plays a significant role in enabling classification of texts in text mining. It is an important technique that has been used to derive feature set from pre-classified text documents. The rule is based on the assumption that there is a relationship between texts in database or text repository. There are instances whereby if an antecedent X happens, and then a consequent Y is also likely to happen (Tagarelli, 2012). Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most significant relationships.

Identification of existence of certain relationships in a database can be very instrumental in decision making. In this case, data classification and clustering will help determine emerging news from a new database. An association rule is an implication of the form X B. each database has set of items and the association rule help identify the number of frequency of every item in the database. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset (Thaicharoen, 2009). A frequent itemset will be essential in the clustering process. Identification of an existing relationship in the database is an imperative aspect in clustering algorithms.

Frequent pattern mining

In text mining, frequent pattern mining (FPM) is among the most intensively investigated aspect in the algorithmic development. For a long time, data mining has been faced with numerous challenges as far as the computation of these algorithmic is concerned. Improving computational efficiency in FPM algorithms has been the major area of concern. To have a clear understanding of this concept, it is critical that one understands the primitive baseline algorithm that is the base of most frequent mining algorithms (King, 2006).

The following analysis will be of great importance in understanding the concept of frequent pattern mining. It will also give insights on how it relates to association rule in data mining.

Assume, T = {T1, T2, . . . , Tn} be a transaction database, where each Ti ∈ T , ∀i = {1 . . . n} consists of a set of items, say Ti = {x1, x2, x3, . . . xl}.A set P ⊆ Ti is called an itemset. The size of an itemset is defined by the number of items it contains. We will refer an itemset as l-itemset (or l-pattern) if its size is l. The number of transactions containing P is referred to as the support of P. A pattern P is defined to be frequent if its support is at least equal to the minimum threshold (Thaicharoen, 2009).

Frequent patterns satisfy a downward closure property, according to which every subset of a frequent pattern is also frequent. This is because if a pattern P is a subset of a transaction, then every pattern P ⊆ P will also be a subset of T. Therefore, the support of P can be no less than that of P. The space of exploration of frequent patterns can be arranged in a lattice, in which every node is one of the 2d possible itemsets, and an edge represents an immediate subset relationship between these itemsets. It is imperative understanding that frequent patterns are often used to generate association rules.

III. OUR APPROACH

We require online RSS news feeds, as a first step, to find the Emerging news. We therefore, store the news in the database tables, for the process of Pre-processing, overview of the system fig (1).

In order to find out the Emerging news needs to covert the text news headlines in to numeric data. In pre-processing we maintain a dictionary of distinct words some words are already add to that dictionary. News feed break down in to parse and when the stop word occur it drop by the stop word removal module. The remaining words are checked in the dictionary that they are already exists or not (Miner, 2012). If the word is already exists in the dictionary then extract I’d of that word if not then assign a specific I’d to the new word and add them to dictionary. Fig (2)

After pre-processing the news feeds become in the form of numbers. Now these numbers are passing on to the Text mining Engine fig (3), module generates candidates of the news, In order to find out the frequent pattern count the frequency how much which word occur again and again in the news (Srivastava & Sahami, 2009). Words have the highest frequency extract that word I’d. Sort the frequent item in ascending order.

The output of the text mining engine is the top frequent items. All the frequent item form all the sources are listed in this module. After that the in post processing frequent item with i.ds, extract the actual words from the dictionary and extract the actual words across the I ‘ds . And sort them according to the i.ds and assemble them.

Assemble the words and add the stop words with them, match the current news feeds with the actual news feed. This news is the emerging news of all the sources of the news.

IV RELATED WORKS

Related works will involve first looking further the association rule and later look at the application of frequent pattern mining.

Association rule and clustering

Association rule is a critical aspect that is mostly involved in generating associative word sets. It is, however, imperative understanding that there are numerous challenges involved in the clustering process. To solve the problem, it is critical we deal with the weaknesses evident in the use of Apriori algorithm that is a common aspect infrequent itemset clustering (Xu, Yasinzai & Lev, 2013). Association rules mining gives the basis for efficient clustering through Web Document Clustering. Other clustering procedures such as Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has also been devised to improve the efficiency of mining association rules that aims at improving mining frequent termset. The performance of association rule can be evaluated with the help of evaluation measures such as F-measure, Recall, and Precision, which are compared with other clustering algorithms.

Application of frequent pattern mining (FPM)

Frequent pattern mining has a broad range of applications that encompasses classification, clustering, software bug detection, recommendations, and other broad problems. Over the years, FPM has been used as a primary source of insights in that provide pattern centered conclusions. The concept of frequent pattern mining has been widely used in the market basket as well as in risk analysis (King, 2006). The concept has also been used in commercial environments, clinical aspects, medicine and crime prevention just to mention a few. The frequent pattern mining makes it predict the consequent after understanding the antecedent.

It is, however, imperative if we undertake to look to other applications that are relevant to this methodology. For instance, the FPM can be of great importance in Network forensic analysis. There are instances where a proposed network forensic analysis is implemented by applying Apriori algorithm. Network forensic becomes an important aspect of the prevention of the network against intrusion (Xu, Yasinzai & Lev, 2013). The large number of data are captured and analyzed in network forensics and after capturing and filtering network data package, the Apriori algorithm is used to mine the association rules according to the evidence relevance to build and update signature database of offense, and further it reduce the number of matching times significantly and improve the efficiency of crime detection (Kanellis, 2006). Simulation results show that the application of Apriori algorithm can raise the speed, exactitude and intelligence of data analysis for network forensics, the pattern can help to resolve the real-time, efficient and adaptable problems in network forensics. The concept can also be highly applicable to the prevention of cybercrimes.

PROJECT EVALUATION

The approach used in the project involves mining emerging news from a set of news. It worth understanding that the process involves using online RSS news feed to help gather emerging news. This online process will help us get emerging news from leading media platforms such as CNN and Reuters. The use of RSS will be of significance importance in forming a dataset from which the data mining process will be realized. By use of RSS’s XML-based format, it will be possible to form a dataset where critical news will be stored in a semi-structured format from which it will be possible to derive emerging news. It is prudent noting that despite the presence of online aggregators, it will not be an easy process of sorting out emerging news from the said dataset (Kanellis, 2006). The formation of a dataset is also a complex process since information gathered from the online tools is generated at intervals, and it is imperative ensuring that we have a mining process that will lead to desired results.

The process of data input will also be of paramount importance in our approach. To subscribe to an online RSS news feed, we will need to have a news aggregator or a feed reader. By the help of a feed reader, it will be possible to subscribe to and view as many news feeds as possible thus making our experiment a worthy course. The news reader will enable automatic retrieval of news updates thus making timely delivery on any news update as soon as they are published. To make more effective, we will use a web-based feed readers that will be compatible with our browser to enhance effectiveness and efficiency in the extraction of emerging news(Xu, Yasinzai & Lev, 2013).. It will also be imperative to note that this RSS feeder will give us an opportunity to get only the required news and only in a formatted code.

It is prudent noting that all news received by news reader is stored in a semi-structured set of the database. The RSS will help sort out in different sets for the purpose of Pre-processing. This is an experimental process that will involve the use of keywords to derive the ID of the news item. The sorting process of the news item will involve the use of the hottest news headline received by our browser. In this case, the process involves extracting news regarding the current state in Syria. To get the essential news, it is imperative to drop keywords such as ‘’The, on, has.’’ From the news, ‘’ the ongoing Syrian conflict has displaced millions’’. After the removal, the remaining words are’’ Syrian, conflict, displaced, millions. This helps easier extraction of the news item from the text mining engine that determines the frequency of news item.

The text mining engine will also be a significant tool in the analysis and explains the parameter setting of our approach. After sentence splitting, the next step in the experiment will involve the tokenization process. This is the stage that will involve generation of hot news item intended in the mining process. It is after the generation of hot news that we set a frequency that the news emerges for example in a period of three hours. With a specific time of three hours, one can determine the algorithms that happen to a certain frequent pattern (Miner, 2012). It is on this premise that it will be possible for us to sort out the frequent item generated by the text mining engine. In essence, the approach used in the data mining process is extracting emerging news from RSS news feeder of XML database through a text mining engine.

CONCLUSION

The objective of this research paper is to give an in-depth analysis of text mining process. As aforementioned, the modern word has experience advances in technology where more and more data is available in digital form. Increased globalization has necessitated the urge for emerging news in all parts of the world. With most of this news being in unstructured textual form, it is imperative that we design better techniques that will enable extraction of emerging and interesting news from the bulk textual data (King, 2006). This call for extensive data pre-processing and post-processing that will enable using the emerging news for the best interest of the community. It is nevertheless prudent noting that the data mining process is not an easy process and has a significant number of challenges. However, with a good approach, it will be easy to complete a comprehensive text mining process. To sum it up, it is possible to identify hot news items that occur in more than some frequency threshold from dynamic datasets

References

Berry, M., & Kogan, J. (2010). Text mining. Chichester, U.K.: Wiley.

Kanellis, P. (2006). Digital crime and forensic science in cyberspace. Hershey PA: Idea Group Pub.

King, S. (2006). Optimizations and applications of Trie-Tree based frequent pattern mining.

Miner, G. (2012). Practical text mining and statistical analysis for non-structured text data applications. Amsterdam: Elsevier/AP.

Srivastava, A., & Sahami, M. (2009). Text mining. Boca Raton, FL: CRC Press.

Tagarelli, A. (2012). XML data mining. Hershey, PA: Information Science Reference.

Thaicharoen, S. (2009). Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining.

Weiss, S. (2005). Text mining. New York: Springer.

Xu, J., Yasinzai, M., & Lev, B. (2013). Proceedings of the sixth International Conference on Management Science and Engineering Management. London: Springer

Do you need an Original High Quality Academic Custom Essay?