By Ronen Feldman
Communications of the ACM, Vol. 56 No. 4, Pages 82-89
Sentiment analysis (or opinion mining) is defined as the task of finding the opinions of authors about specific entities. The decision-making process of people is affected by the opinions formed by thought leaders and ordinary people. When a person wants to buy a product online he or she will typically start by searching for reviews and opinions written by other people on the various offerings. Sentiment analysis is one of the hottest research areas in computer science. Over 7,000 articles have been written on the topic. Hundreds of startups are developing sentiment analysis solutions and major statistical packages such as SAS and SPSS include dedicated sentiment analysis modules. There is a huge explosion today of ‘sentiments’ available from social media including Twitter, Facebook, message boards, blogs, and user forums. These snippets of text are a gold mine for companies and individuals that want to monitor their reputation and get timely feedback about their products and actions. Sentiment analysis offers these organizations the ability to monitor the different social media sites in real time and act accordingly. Marketing managers, PR firms, campaign managers, politicians, and even equity investors and online shoppers are the direct beneficiaries of sentiment analysis technology.
It is common to classify sentences into two principal classes with regard to subjectivity: objective sentences that contain factual information and subjective sentences that contain explicit opinions, beliefs, and views about specific entities. Here, I mostly focus on analyzing subjective sentences. However, I refer to the usage of objective sentences when describing a sentiment application for stock picking.
In this review, I will focus on five specific problems within the field of sentiment analysis:
- Document-level sentiment analysis
- Sentence-level sentiment analysis
- Aspect-based sentiment analysis
- Comparative sentiment analysis
- Sentiment lexicon acquisition.
Before explaining each of these problems in detail, let’s review a general architecture of a generic sentiment analysis system. The architecture is shown in Figure 1.
The input to the system is a corpus of documents in any format (PDF, HTML, XML, Word, among others). The documents in this corpus are converted to text and are pre-processed using a variety of linguistic tools such as stemming, tokenization, part of speech tagging, entity extraction, and relation extraction. The system may also utilize a set of lexicons and linguistic resources. The main component of the system is the document analysis module, which utilizes the linguistic resources to annotate the pre-processed documents with sentiment annotations. The annotations may be attached to whole documents (for document-based sentiment), to individual sentences (for sentence-based sentiment) or to specific aspects of entities (for aspect-based sentiment). These annotations are the output of the system and they may be presented to the user using a variety of visualization tools.
Document-Level Sentiment Analysis
This is the simplest form of sentiment analysis and it is assumed that the document contains an opinion on one main object expressed by the author of the document. Numerous papers have been written on this topic. There are two main approaches to document-level sentiment analysis: supervised learning and unsupervised learning.
The supervised approach assumes that there is a finite set of classes into which the document should be classified and training data is available for each class. The simplest case is when there are two classes: positive and negative. Simple extensions can also add a neutral class or have some discrete numeric scale into which the document should be placed (like the five-star system used by Amazon). Given the training data, the system learns a classification model by using one of the common classification algorithms such as SVM, Naïve Bayes, Logistic Regression, or KNN. This classification is then used to tag new documents into their various sentiment classes. When a numeric value (in some finite range) is to be assigned to the document then regression can be used to predict the value to be assigned to the document (for example, in the Amazon five-star ranking system). Research has shown that good accuracy is achieved even when each document is represented as a simple bag of words. More advanced representations utilize TFIDF, POS (Part of Speech) information, sentiment lexicons, and parse structures.
Unsupervised approaches to document-level sentiment analysis are based on determining the semantic orientation (SO) of specific phrases within the document. If the average SO of these phrases is above some predefined threshold the document is classified as positive and otherwise it is deemed negative. There are two main approaches to the selection of the phrases: a set of predefined POS patterns can be used to select these phrases or a lexicon of sentiment words and phrases can be used.
The most common application of sentiment analysis is in the area of reviews of consumer products and services. There are many websites that provide automated summaries of reviews about products and about their specific aspects. A notable example of that is “Google Product Search.”
Twitter and Facebook are a focal point of many sentiment analysis applications. The most common application is monitoring the reputation of a specific brand on Twitter and/or Facebook. One application that performs real-time analysis of tweets that contain a given term is tweetfeel (http://www.tweetfeel.com).
Sentiment analysis can provide substantial value to candidates running for various positions. It enables campaign managers to track how voters feel about different issues and how they relate to the speeches and actions of the candidates. An analysis of tweets related to the 2010 campaign can be found at http://www.nytimes.com/interactive/us/politics/2010-twitter-candidates.html.
Another important domain for sentiment analysis is the financial markets. There are numerous news items, articles, blogs, and tweets about each public company. A sentiment analysis system can use these various sources to find articles that discuss the companies and aggregate the sentiment about them as a single score that can be used by an automated trading system. One such system is The Stock Sonar (http://www.thestocksonar.com). This system (developed by Digital Trowel) shows graphically the daily positive and negative sentiment about each stock alongside the graph of the price of the stock.