January 20, 2011

Analytics with Twitter Data

Twitter is one of the largest "data-producers" on the Web presently. Am not sure about the exact numbers of the storage that the tweets require on a daily basis, but a few TBs would not surprise me; add to that the spurts in volumes when there is a controversy or some event happening. All of this leads to interesting data that needs to be deciphered; and also some awsome research work that can be applied to manage the data efficiently for the users and engage them more with Twitter.

When i was looking for possible features that I might actively use, i actually could list a few of them. I am pretty sure that the Product Managers at Twitter would have some of these features in their TODO list, but would be interesting to see when these actually get implemented; or the rationale behind not implementing them.

1. Users who follow you, but you do'nt follow them.
2. Users whom you follow, but they don't follow back.
3. Notification when a user stops following you. I need to research on why Twitter does not have this - was this by design?
4. Trend analysis of users who follow/quit you - based on the tweets that you do.
5. Show the most active users and lazy users - active and lazy are defined by the number of tweets and also the popularity of the tweets.Popularity can also be measured by how much discussion a tweet generates, or how much retweets happen for that tweet.
6. Automatic lists and follow suggestion : when we follow a user, twitter can suggest which would be the most likely fit for a user based on his tweet patterns. The present Suggestion scheme is not all powerful and needs some tweaking.
7. Discover clusters/groups of the followers. Centrality of users - show a graph wherein this relationship can be displayed.
8. Decipher moods/sentiments from the tweets; or other possible natural language processing techniques that can be applied on the tweets to gather interesting patterns or insights.
9. Usage analysis
  a. Based on the day of the hour we can find out do people tweet often during mornings or evenings.
  b. Do people prefer the web or mobile devices for tweeeting. What % of people uses other apps?
  c. Who retweets you often? or what category of tweets by you get retweeted often or generate the maximum discussions.
10. Most famous tweets for the day/week/month - based on retweets, follow-up discussions, celebrity status of the tweeter, number of followers.
11. Duplicate detection of tweets. Also, automatic compression of tweets which fall in a thread. This would help a lot in reducing the information clutter.
12. what is the similarity between two users - based on the nature of tweets. Corollary would be : what topics/categories does a user often tweet on?
13. Better trend analysis.

Firefox 4 around the corner

Mozilla announced that the latest Firefox 4 Beta browser is ready for beta release and users can download it and check out the cool features that are being introduced in this version. New features like App Tabs and  Panorama are going to make the web navigation more easier and efficient; the team has  introduced many  new features under the hood  which will result in faster page loads and also a speedier startup.

Not only for the layman, but even the developers can take advantage of the HTML5 features to make the web more engaging - features like WebM , HD video, 3D graphic rendering with WebGL, hardware acceleration and the Mozilla Audio API can be used to create more interesting applications. A full overview of the featureset can be found here.

Looks like other browsers like Internet Explorer and Chrome have to really innovate and keep up the momentum to match with Mozilla's Firefox.

January 19, 2011

Do you have any of these buzzwords in your resume?

Linkedin came out with the list of the top 10 most often used buzzwords  - the words that people use in their profiles while using linkedin from USA.

Top 10 overused buzzwords in LinkedIn Profiles in the USA – 2010
   1. Extensive experience
   2. Innovative
   3. Motivated
   4. Results-oriented
   5. Dynamic
   6. Proven track record
   7. Team player
   8. Fast-paced
   9. Problem solver
  10. Entrepreneurial

Also, they did some analytics on these buzzwords and found out that the phrase "Extensive experience" is most often used in  profiles of people from Australia, Canada and USA whereas people from Brazil and India mostly use the term "Dynamic". 'Innovative' is most often used in the European region; goes onto show why the Dutch always master the art of design.



This analysis by the Linkedin team has led to many revamping their profiles to avoid the so called 'cliched' terms. Do you have any of these buzzwords in your resume? Do you like it? Will you change it after this study or will include it in your profile if you already do not have it?

January 13, 2011

Analysis of My First Mozilla Open Data Visualization Competition Entries

The First Mozilla Open Data Visualization Competition results are out. I had submitted 3 entries [one] [two] [three] for this competition and as I had imagined, my entry did not win any nor did it get any mention (I would have been surprised if it had got any!).

I kind of expected this, and realized it during the last few weeks before the winners were to be announced. I reviewed my submissions and found that I had not done justice to my analysis and there were many open questions; or avenues that could be bettered.

Self-Analysis and Comments:
1. As soon as i saw the data I jumped on it. I loaded the sample data into sqlite3 tables and started firing queries and started generating the charts. THIS was a BIG mistake, I should have taken some more time to read the structure of the data and probably cleanse it, and normalize the dataset.
I think i was overjoyed by seeing a 'real' dataset and how i could 'directly' contribute to Firefox in this analysis. The adrenaline rush made me do this blunder.
2. I also spent quiet sometime googling the already submitted entries so that mine was different from others. Though, this helps sometimes, i think it pressurizes one more and narrows down the vision. Treating the data holistically and deriving all possible analysis, or choosing a subset of the data and then analysing it should have been the way to go.
3. I would like to again state the fact that i did not normalize the data - this was the crucial step.
4. I should have spent a weekend on developing a dashboard or webpage with which people could play around. The excuse of time prevented me from doing this.
5. Verbiage - charts/images are good, but it is always nice to include some verbiage along with it when you do not provide a dashboard kind of an interface.
6. Lack of any statistical analysis - most of the analysis that I have done are pure SQL query based manipulations. I am working on this front and learning more statistical analysis techniques, which will help me in the longer run.
7. Some of mycharts were pure junk and did not convey the right message!
8. I should have used better charting libraries - those that have better presentation and are pleasant to the eyes. In the adrenaline rush, i overlooked this aspect.I thought of moving the charts to protovis, but i was too lazy once i submitted my entries(and also i got pulled into other visualizations).

Having understood (and realized) the mistakes that i did, and also the loads of learning that happened during and after the contest has helped me a lot; and am better prepared for the next visualization/data-analysis challenge. This self analysis did help me a lot.

Btw... Mozilla guys are giving away free Tshirts for all participants :)

January 03, 2011

Simple Article Extractor from HTML

The following is a simple article extractor from a given web(html) page. Being in Python its simple and is less than 55 lines of code. I tried this on a few webpages , and was satisfied with the output.
Though i have mentioned the comments as part of the code, the following is a quick HOWTO of how to make modifications to this article extractor:
1) To extract meta information , like author, title, description, keywords etc -   extract the meta tags in line 30, i.e, after the soup object is constructed, but before the tags are stripped. Also, in strip_tags, return a tuple instead of the text alone.
2) Understand how 'unwanted_tags' works; feel free to add the ids/class names that you might encounter. I have mentioned only a few, but more names like "print","popup","tools","socialtools" can be added.
3) Feel free to suggest any other improvements.

from BeautifulSoup import BeautifulSoup,Comment
import re

invalid_tags = ['b', 'i', 'u','link','em','small','span','blockquote','strong','abbr','ol','h1', 'h2', 'h3','h4','font','tr','td','center','tbody','table']
not_allowed_tags = ['script','noscript','img','object','meta','code','pre','br','hr','form','input','iframe' ,'style','dl','dt','sup','head','acronym']

#attributes that are checked for in a given html tag - if present, the tag is removed.
unwanted_tags=["tags","breadcrumbs","disqus","boxy","popular","recent","feature_title","logo","leaderboard","widget","neighbor","dsq","announcement","button","more","categories","blogroll","cloud","related","tab"]

def unwanted(tag_class):
  for each_class in unwanted_tags:
    if each_class in tag_class:
      return True
  return False

#from http://stackoverflow.com/questions/1765848/remove-a-tag-using-beautifulsoup-but-keep-its-contents
def remove_tag(tag):
  for i, x in enumerate(tag.parent.contents):
    if x == tag: break
  else:
    print "Can't find", tag, "in", tag.parent
    return
  for r in reversed(tag.contents):
    tag.parent.insert(i, r)
  tag.extract()

def strip_tags(html):
  tags = ""
  soup = BeautifulSoup(html)
 
  #remove doctype
  doctype = soup.findAll(text=re.compile("DOCTYPE"))
  [tree.extract() for tree in doctype]
 
  #remove all links
  links = soup.findAll(text=re.compile("http://"))
  [tree.extract() for tree in links]
 
  #remove all comments
  comments = soup.findAll(text=lambda text:isinstance(text, Comment) )
  [comment.extract() for comment in comments]
 
  for tag in soup.findAll(True):
    #remove all the tags that are not allowed.
    if tag.name in not_allowed_tags :
      tag.extract()
      continue
   
    #replace the tags with the content of the tag
    if tag.name in invalid_tags:     
      remove_tag(tag)
   
    # similar to not_allowed_tags but does a check for the attribute-class/id before removing it
    if unwanted(tag.get('class','')) or unwanted(tag.get('id','')) :
      tag.extract()
      continue
   
    # special case of lists - the lists can be part of navbars/sideheadings too,
    # hence check length before removing them
    if tag.name =='li':
      tagc = strip_tags(str(tag.contents))
      if len(str(tagc).split()) < 3:
        tag.extract()
        continue
   
    #finally remove all empty and spurious tags and replce it with its content
    if tag.name in ['div','a','p','ul','li','html','body'] :
      remove_tag(tag)
     
  return soup
#open the file which contains the html
#this step can be replaced with reading directly from the url
#however, i think its always better to store the html in the
#  local storage for any later processing.
html = open("techcrunch.html").read()
soup = strip_tags(html)
content = str(soup.prettify())

#write the stripped content into another file.
outfile = open("tech.txt","w")
outfile.write(content)
outfile.close()



If the formatting is screwed up, then you can access the code here or here.