Our project – Real-time Emotion Analysis on Twitter – has completed. We are thinking about the extensibility of our project and we get some ideas.
We use spout and bolt in our project to process data. Spout is in master node and it is like the data file input in Hadoop. Bolt is in worker node and it is like the worker in Hadoop. In our project, the spout is to read Twitter Streaming data and send it to several bolts, which is the process of mapper. And bolt can send data to another bolt. So it is like a chain. Whenever we want to add some new features into our system, you just need to write a new bolt and add it into the processing chain.
In our project, there are two kinds of bolts. The first kind of bolt is to analyze the tweet and extract the useful information and add the emotion value. The the data will be sent to the next bolt, which is responsible to collect and gather the data from several source bolts and publish them into a Redis channel.
So here comes our scalability and extensibility. For the tweet analysis, if we also want to analyze the hashtags as well, all we need to do is to add a kind of bolt in the chain. Then you can either send the result to the reducer or just send the result into another redis channel.
Susan Zeng, an IBMer and professor at U Missouri brought us 2 projects from the Computer Science Department.
One is “Real-time Emotion Analysis On Twitter” – where students will attempt to analyze tweets to produce reports on feelings to help business people enhance services and products.
The other is “Identifying Gene Duplications Across DNA Sequences” – since these genes are associated with cancer, information about this could be useful as a step in helping find cures.
We encountered the issue about large amount of intermediate results are generated by Map Task. It degrades our performance a lot. By searching solution online, we decided to setup Hadoop cluster using Lzo module to compress both map and reduce results. Couple of good resources & tutorial for setting up Lzo on Hadoop list below:
Details of setting up Lzo package:
- Download lzo package: git clone git://github.com/toddlipcon/hadoop-lzo.git
- Install required tools: lzo-devel, ant, java and gcc
- Build it by ant: ant clean compile-native tar
- Copy built library to path ~/hadoop-1.2.1/lib/native/Linux-amd64-64/ on all nodes (master and slaves)
- Add configure in both core-site.xml and mapped-site.xml
We use real-time big data analysis technologies to evaluate tweets about some brands and products. We will judge the emotion of tweets and decide if the tweet expresses positive, negative or no emotion about a specific brand or product. The company or merchant can observe the trend of customers’ emotion towards its brand or products, understand customers’ online behavior or reaction, based on which they can make better decisions and provide better services. This project will help companies find out what their customers feel and adjust their marketing strategies timely.
Smarter Planet Information
When a company release its new products, the managers of this company must want to know what are people’s opinions about their products, or how their customers feel when they use their products. In this way, they can get people’s online reaction quickly and make better decisions onwards. Nowadays, thanks to the development of social network, it becomes true.
When customers use a product, they probably have some feelings, such as happy or hate. In old days, they could just communicate with people around them. However, today, they can post their feelings on social network, and instantly his fiends around the world can see it. Then maybe some of his friends will re-post it and more people will see it. The social network changes the way information spread and lead us to a new era of information.
Twitter is one of the most popular social networks. People tweet their emotion, feelings, status and locations and share interesting things with their friends.
However, sometimes these data are very large and hard to analyze. Our project is to use big data analytic technologies to evaluate these massive amount of data and extract the most useful in- formation, feelings. In addition, we plan to make it work in real-time so that the company can see the marketing reaction instantly. It can also help companies to provide better customer services in the future. For the investors, they can use these information to help them make decisions on investment. What is more, people can make stock price prediction and decide whether to buy their stocks or not.
Our project is in line with the topic “Smarter Commerce” in IBM smarter planet. It can help companies to optimize their marketing processes by tracking online behavior to inform market- ing decisions, and capturing data from customers interactions for real time marketing. It provides a way for them to do better business.
The following technologies may subject to change during the process of development. These are some technologies that may be used in our project, but some of them still need more comparisons and to be decided then.
Hadoop: Hadoop is an open-source software framework that supports data-intensive distributed applications, and also supports running applications on large clusters of commodity hardware.
Impala: Cloudera Impala is an open source project created by Cloudera that provides SQL query execution to enable fast interactive ad-hoc queries, on data stored in Hadoop.
Storm: Storm is a free and open source distributed realtime computation system. It has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. – Semantic Analytics: We need to use semantic analytics to extract the emotion information from massive amount of tweets related to a specific brands or products. We still need a real-time visualizing tool to present the output.
Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Others: As all of our group members are familiar with Java, so we decided to use Java as our main programing language. But we may still use some other script language, such as Bash Script, Python, etc, to complete some simple tasks.
The architecture of the whole system may be presented like this below. We are going to detail it in the future and make it better.
We think a lot about the problems we may encounter during the development. Here is a short list of some risks and how we prepare to face them.
Real-time processing: This is the first time for all the our group members to use real-time processing, especially with Hadoop and related tools. It is very new but some researches have been done in these years. Some open-source products are released like Storm and Impala, which will help us a lot. We need to try our best to learn how to build real-time processing system and figure out how to apply it on Hadoop in order to take advantage of the MapReduce computing model and Hadoop File System.
This risk also includes the flow balance system. How to deal with many real-time requests and how to balance the usage of each node is what we will figure out.
Semantic Analytics: This is another challenge for us because, like real-time processing, this is also the first time we use it. So we are woking on some papers in this field these days and try to apply it to analyze the emotion status towards a brand or product in tweets. This is one of the most important parts of the whole system, and we need to let it run on multiple node using Ma- pReduce.
 “Hadoop” http://hadoop.apache.org/
 “Impala” http://www.cloudera.com/content/cloudera/en/products/cdh/impala.html
 “Storm” http://storm-project.net/
 “Hive” http://hive.apache.org/
 “Flume” http://flume.apache.org/