University of Missouri, US – 2 projects:
“RaW”, led by Chao Fang, finding protein substructures to enhance knowledge of protein function using analytics!
“DORK”, led by Xinjian Yao, using analytics to understand Medicare data to further understand what happens in the system.
University Putra Malaysia: Embedded Systems for Public Water Management, led by BALAMI Emmanuel Luke, toword making water supplies more sustainable.
Lumbini Engineering College, Nepal – event held September 28 to inform students about Smarter Planet, Bluemix, Watson, IBM and Students for a Smarter Planet opportunities. Led by Basant Pandey.
Our project – Real-time Emotion Analysis on Twitter – has completed. We are thinking about the extensibility of our project and we get some ideas.
We use spout and bolt in our project to process data. Spout is in master node and it is like the data file input in Hadoop. Bolt is in worker node and it is like the worker in Hadoop. In our project, the spout is to read Twitter Streaming data and send it to several bolts, which is the process of mapper. And bolt can send data to another bolt. So it is like a chain. Whenever we want to add some new features into our system, you just need to write a new bolt and add it into the processing chain.
In our project, there are two kinds of bolts. The first kind of bolt is to analyze the tweet and extract the useful information and add the emotion value. The the data will be sent to the next bolt, which is responsible to collect and gather the data from several source bolts and publish them into a Redis channel.
So here comes our scalability and extensibility. For the tweet analysis, if we also want to analyze the hashtags as well, all we need to do is to add a kind of bolt in the chain. Then you can either send the result to the reducer or just send the result into another redis channel.
Susan Zeng, an IBMer and professor at U Missouri brought us 2 projects from the Computer Science Department.
One is “Real-time Emotion Analysis On Twitter” – where students will attempt to analyze tweets to produce reports on feelings to help business people enhance services and products.
The other is “Identifying Gene Duplications Across DNA Sequences” – since these genes are associated with cancer, information about this could be useful as a step in helping find cures.
We encountered the issue about large amount of intermediate results are generated by Map Task. It degrades our performance a lot. By searching solution online, we decided to setup Hadoop cluster using Lzo module to compress both map and reduce results. Couple of good resources & tutorial for setting up Lzo on Hadoop list below:
Details of setting up Lzo package:
- Download lzo package: git clone git://github.com/toddlipcon/hadoop-lzo.git
- Install required tools: lzo-devel, ant, java and gcc
- Build it by ant: ant clean compile-native tar
- Copy built library to path ~/hadoop-1.2.1/lib/native/Linux-amd64-64/ on all nodes (master and slaves)
- Add configure in both core-site.xml and mapped-site.xml