A Survey on Sentiment Analysis and Opinion Mining

A Survey on Sentiment Analysis and Opinion Mining Computer Engg. Department, YMCA University of Science & Technology, Faridabad ABSTRACT Sentiment analysis or opinion mining is the computational study
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
A Survey on Sentiment Analysis and Opinion Mining Computer Engg. Department, YMCA University of Science & Technology, Faridabad ABSTRACT Sentiment analysis or opinion mining is the computational study of people's opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics and their attributes[1]. This is an Information Extraction task which is technically very challenging but also practically very useful. With the advent of web 2.0, huge volumes of opinionated text is available on web. To extract sentiment about an object from this huge web, automated opinion mining systems are thus needed. The existing techniques for sentiment analysis includes machine learning and lexical-based approaches. This paper aims at presenting the various existing techniques and work done for sentiment analysis till date with issues pertaining to this field and future research prospects in this area. 1. INTRODUCTION In today's web world, textual information which is available can be basically categorized into two broad categories, facts and opinions. Facts present the objective statements about the events and objects. Opinions or sentiments are the subjective statements that reflects the sentiments and perception of a person about an event or an object. The technique to extract subjective information from text and determining the overall contextual polarity or opinion of the writer of the text is called sentiment analysis. Bing Liu defined sentiment analysis as Given a set of evaluative text documents D that contain opinions (or sentiments) about an object, opinion mining or sentiment analysis aims to extract attributes and components of the object that have been commented on in each document d belongs to D and to determine whether the comments are positive, negative or neutral [1] . Sentiment Analysis is technically very challenging but also practically very useful. For instance, for businesses it is always useful to know public or consumer opinions about their product and services so as to know the reasons of their profits or losses. Also, the customer who wants to buy a product would like to know the opinions about that product from the existing users. In general, sentiments can be expressed on almost anything and everything, e.g., a product, a service, an event, or an individual. The entity that has been commented about is termed as an object. An object can be defined as an object O is an entity which can be a product, topic, person, event, or organization. It is associated with a pair, O: (T,A), where T is a hierarchy or taxonomy of components or parts and sub-components of O, and A is a set of attributes of O. Each component has its own set of sub-components and attributes [1]. In this hierarchy of object, the object is at the root and each non-root node is a component or sub-component of the object. The link between two nodes reflects a part-of relationship. Also, each node in the tree is associated with a set of attributes. An opinion can be expressed on any node and on any particular attribute of that node. To simplify, both components and attributes are represented by a word feature . More specifically, a Sentiment or opinion can be defined as a quintuple- Oj, f jk, so ijkl, h i, t l , where 107 108 International Journal of Innovations & Advancement in Computer Science O j denotes the target object, f jk denotes the feature f k of the object O j, h i denotes the opinion holder of the opinion, t l is the time when the opinion was expressed, so ijkl denotes the sentiment value of the opinion by the opinion holder hi on feature f jk of object O j at time t l. so ijkl can be positive, negative or neutral, or a more granular polarity rating. In general, to find the sentiment of an object, first the sentiment words about the features of the said object are identified. Various features are then assigned weight either on just on the basis of their polarity or on the basis of both polarity and importance. Finally, the polarity weights of all the features of the object are aggregated in order to get the overall opinion about the given object. Basic approaches for measuring sentiments are broadly classified into machine learning methods and lexicon-based approaches. The rest of the paper is organized as follows: Section 2 presents various levels of classifying sentiments and so the various levels of performing sentiment analysis. Section 3 describes the literature survey done on sentiment analysis techniques. Section 4 puts forth challenges in sentiment analysis. Concluding remarks are mentioned in section SENTIMENT CLASSIFICATION Sentiment classification or polarity classification is a technique for analyzing subjectivity in a large number of texts, wherein each piece of text is labeled positive, negative or neutral as per the overall opinion expressed in that text. Sentiment Classification can be done at any one of the three levels: the document level, sentence level, feature level. The higher the level of classification is in the pyramid, the more difficult sentiment analysis will be because up there will have a mixture of things and mixture of opinions Document Level Sentiment Classification: In document level sentiment analysis the sentiment in the entire document is summarized as positive, negative or objective. Herein, the main challenge is to extract subjective text for inferring the overall sentiment of the whole document.[2] For document level sentiment classification an assumption that each document focuses on a single object and contains opinions from a single opinion holder Sentence Level Sentiment Classification: In sentence level sentiment analysis the individual sentences bearing sentiments in the text are classified. It is a fine-grained classification level than document level sentiment classification in which polarity of the sentence can be given any of the three polarities i.e. positive, negative and neutral [2]. The sentence level polarity identification can be done in either of the two ways: a grammatical syntactic approach or a semantic approach [33]. The grammatical syntactic approach takes grammatical structure of the sentence into account by considering parts of speech tags. However, the semantic approach uses the frequency of positive and negative words to determine the overall polarity of the sentence Feature Level Sentiment Classification: Product attributes or components are referred to as product features. Analysis of all such said features in a document or sentence is called feature sentiment analysis. In feature level sentiment classification, from the already extracted features, opinion is determined.[3] 3. SENTIMENT ANALYSIS APPROACHES The done literature survey has indicated two major techniques for sentiment analysis: machine learning techniques and lexicon based techniques. However hybrid of the two has also been tried and tested to obtain the best of both the worlds [20] [21] [22]. A complete literature survey has been done and a summary of the studied articles has been shown in Table 1. 109 International Journal of Innovations & Advancement in Computer Science 3.1. MACHINE LEARNING APPROACH Like human learns from the past experiences, a computer doesn't have experiences but it learns from data, which represent some past experiences of the application domain. Arthur Samuel(1959) defined Machine learning as field of study that gives computers the ability to learn without being explicitly programmed. The machine learning methods for sentiment analysis often rely on supervised classification methods. In this approach, labeled data is used to train classifiers. In supervised machine learning based classification, two sets of data are required: training and test data. The training data contains a set of training examples. In supervised learning each example is a pair of an input object and a desired output value and hence a function is inferred from the labeled training data. A test data is the unseen data to validate the classifier accuracy. Machine learning technique starts with firstly aggregating training dataset. then, the next step is to train the classifier on the given training set. Once a supervised classification approach is selected, the most important decision to make is the feature selection. In sentiment classification, the most commonly used features are the following: Term presence and their frequency: This feature includes uni-gram or n-grams and their presence or frequency. For movie review sentiment analysis, Pang et al. [14] claimed that uni-grams give better performance results than bigrams. However, Dave et al. [16] claimed that bi-grams and tri-grams give much better polarity classification for product reviews. Part of speech information: Tagging words to their POS tags helps in disambiguating sense and which in turn helps to guide feature selection [17]. In POS tagging, each word in a sentence is assigned a label. The label represents the position or role of the word in the grammatical context. For instance, POS tags can be used to identify adjectives and adverbs, which are extensively used sentiment indicators [13]. Negations: Negation words are also an important feature to be taken into consideration because negations has the potential of reversing the sentiment [17]. Opinion words and phrases: Opinion words and phrases are the words and phrases which express positive or negative sentiments. Lexicon based and statistical based are the main approaches to identify the semantic orientation of an opinion word. For example, WordNet was used by Hu and Liu et al. [4] to determine th polarity of the extracted adjective. Machine learning techniques such as Naive Bayes, maximum entropy and support vector machines have achieved tremendous success in text categorization. The other most well known machine learning techniques in the natural language processing area are K nearest neighborhood, ID3, C5, centroid classifier, winnow classifier and the N-gram model. A. Naive Bayes Classifier The Naive Bayes classifier is the most simplest and commonly used classifier for text classification. Naive Bayes classifier is based on the computation of posterior probability of a class, which is based on the distribution of the words in the document. The model ignores the POS or position of the word in the document. A simple Bayes Theorem is used to predict the probability of the label of the feature newly extracted from the document given feature set belonging to that label. P(label) denotes the prior probability of a label, (1) P(features) denotes the prior probability that this given feature set has occurred, P(features label) denoted the prior probability that a given feature set is being classified in label. The Naive Bayes approach follows an assumption which says that all features are independent, hence the above equation can be rewritten as: B. Maximum Entropy Classifier Maximum Entropy Classifier (MaxEnt or ME) is an efficient technique which has proven effective in a number of natural language processing applications. Nigam et al. (1999) has shown that MaxEnt sometimes, but not always, outperforms Naive Bayes in standard text classification. The Maximum Entropy classifier uses encoding to convert labeled feature set to corresponding vectors. The encoded vector then is used to compute weights of each feature, that can be then summed up to determine the most likely label for each feature set. Its estimate of P(c d) takes the exponential form as: [8] (3) where, P ME (c d) is the probability of instance d being in class c, λ i, denotes a feature weight parameter, Z(d) is a normalization function, F i,c denotes a feature /class function for a extracted feature f i and class c, as given in eq (4) F i,c (d, c') := = (4) Maxent gives better performance when conditional independence assumptions are not met because it makes no assumptions about relationships between features, as done in Naive Bayes. The parameter values are set in such a way such that the entropy of the induced distribution is maximized, hence the classifier's name, with subject to the constraint that the expected values of the feature/ class functions with respect to the model are equal to their expected values with respect to the training data. C. Support Vector Machines Support Vector Machines (SVMs) have shown high performance and is highly effective at traditional text categorization. SVMs are a supervised machine learning classification method in which a kernel function is used to map an input feature space into a new linearly separable class space. The basic idea for the training procedure is to find a maximum margin hyper plane, represented by a vector, which separates the document vectors in one class from those in the other class, but also the separation, or margin is as large as possible. This corresponds to a constrained optimization problem, letting c j {1, -1} corresponding to positive and negative, be the correct class of document d j. Joachims, 1999 implemented a SVM which uses a sparse vector representation. This implementation has used an optimization algorithm and hence can handle thousands of features and training instances. All experiments done are with default parameter setting, including a linear kernel. This is however in line with the use of SVM by Pang et al. (2002, 2004). Performance Summary: The performance comparison done by Pang et al. [14] for three classifiers i.e. Naive Bayes, Maximum Entropy Classifier and Support Vector Machines in sentiment classification at document level has revealed that when considering different features like unigrams, bigrams, combination of both, combining parts of speech and unigrams, taking only adjectives, combining adjectives with (2) 110 111 International Journal of Innovations & Advancement in Computer Science unigrams and position information, the feature presence is reported more important than feature frequency. Also, Naive Bayes perform better than SVM, when feature set is small. However, SVM perform better when feature set is of large size. Maximum Entropy also performs better than Naive Bayes when feature set size is increased, but it may suffer from over fitting LEXICON-BASED APPROACH A piece of text or document contains some opinion words, may also contain some opinion phrases and idioms, which together are termed as opinion lexicon, which are used for sentiment classification. The lexicon methods vary according to the context in which they were created. The hypothesis is that, to determine the semantic orientation of the entire text, the semantic orientation of each individual word must be available or is first determined. More precisely, the lexicon based approach is based on the assumption that the sum of the sentiment orientation of each word or phrase gives the total contextual sentiment orientation. Advantage: Lexical based methods do not need any training data. Drawback: It is hard to create a unique lexical-based dictionary for each different context. There are broadly three main methods for collecting or compiling the opinion word list. Among the three approaches, the Manual approach is very tedious and time consuming and therefore, is not used in standalone manner. It is usually used in combination with either of the two automated approaches as validation check to avoid any mistakes resulted from automated methods. The two automated approaches are explained in the following subsections. A. Dictionary based approach Lexicon-based dictionary methods uses a predefined dictionary of words which has list of words, where each word is associated with a specific sentiment polarity and polarity strength. [8,9] described the main strategy of the dictionary based method. Herein, a small set of sentiment words with known orientation is composed manually. This set is then further grown by adding their synonyms and antonyms to the set. These synonyms and antonyms are searched in well known corpora like WordNet [5] or thesarus [6]. Opinion words exhibit the same orientation as their synonyms and opposite orientations as their antonyms exhibit. After the newly found words are added to the seed set, the next iteration is then started. This iterative process comes to an end when no new words to be added are found. Hu and Liu [4] used this approach to find semantic orientation of adjectives. A seed list of 30 adjectives was used. Qiu and He [7] has used dictionary based approach to extract sentiment sentences in contextual advertising data. They have proposed a strategy so as to improve advertisement relevance and user experience. Drawback: The drawback that dictionary based approach suffers from is the inability to find sentiment words with domain and context specific orientations. Advantage: This approach eliminates the need of learning to the data since a pre listed dictionary of words with their orientation is available. B. Corpus-based approach In this approach a huge corpus is used in order to cover all English words. The huge domain specific corpus let's find domain and context specific opinion words. The corpus-based method depends on the syntactic patterns along with the seed set of opinion words in order to find other opinion words in large corpus. This method has been explored by Hatzivassiloglou and McKeown [15]. In their proposed approach, the process is started with a list of seed opinion adjectives, with a set of linguistic constraints in order to identify additional adjective opinion words and their orientations. The linguistic constraints are for connectives like AND, OR, BUT, EITHER, EITHER-OR etc. The connective constraints for example are for conjunction AND which says that the conjoined adjectives i.e. which are connected with AND usually have the same orientation. Also, there are adversative expressions like but, however which indicates change of opinion. Further, learning is used to determine if two conjoined adjectives are of same or different orientations. Then, the graph is plotted from links between the graphs and clustering is performed on the graph so as to produce two sets: positive word set and negative word set. Drawback: The corpus based approach if used alone is not much effective since it is difficult to prepare a huge corpus to cover all contextual words. Advantage: This approach helps to find domain and context specific opinion words and their orientations using a domain specific corpus, which was certainly not possible in dictionary based approach. Paper, Ref. No. Technique Language Dependent Lexicon Usage Tagged Review Dataset Used Pang et al. [14] NB, ME, SVM Yes No Yes IMDB Dave et al. [16] SVM,NB Yes No Yes Amazon, CNET Hu & Liu [4] Lexicon, tagged reviews Yes Yes Yes Amazon, CNET Zhang et al. [19] Lexicon yes Yes No Luce, Yoka A. Khan et al. [18] Lexicon Yes Yes No IMDB, Skytrax, Tripadvisor M. Ghosh, G. Sanyal Lexicon Yes Yes No Amazon.com, Ebay.com, epinion.com Ji Fang and Bi Chen [22] Moreo A, Romero M [31] 112 ML, Lexicon Yes Yes Yes Multi-Domain Sentiment Dataset Lexicon, NLP Yes Yes Yes News Mohammad SM [29] Lexicon Yes Yes No Enron corpus Rui H., Liu Y. W. NB, SVM Yes No yes Movie reviews, Andrew[30] tweets K. Fazel, I. Diana [32] Corpus-based Yes Yes No Blogs data Heerschop B [27] Apriori, NB Yes Yes No Movie Reviews R. Ghosh, B. Liu [20] Lexicon, ML Yes Yes Yes Twitter A. Khan [26] NB No No Yes Movie review airline and airport review, Hotel review Rudy Prabowo, Mike Thelwall [23] A. Mudinas, D. Zhang,, M. Levene [21] SVM, Hybrid Yes Yes Yes Movie review, Myspace comments ML, Lexicon Yes Yes Yes CNET, IMDB M. Rushdi [25] SVM No No Yes Blogs and product reviews Table 1: Article Summary 113 International Journal of Innovations & Advancement in Computer Science C. Lexicon based and natural language processing techniques Natural Language Processing (NLP) techniques can be combined with the lexicon based approaches in order to increase its performance and coverage. The NLP techniq
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks