
Summary
In this article, I describe how I generated a list of most cited books and movies from the profiles of the people who are connected to me at Orkut, Google's online social-networking service. Orkut does not provide any Web-Services at this time that allow for writing such a tool, so I wrote a script in Perl that used the WWW::Mechanize module available from CPAN and parsed the HTML of the webpages that contained the required data. The results were very satisfactory, and I strongly believe that there is a whole wealth of information that can be discovered by aggregating the user-data available in social-networks. Perhaps, Google should consider providing statistics to each user in Orkut, about the most popular books and movies among his/her friends, and link each of them to Amazon, or another shopping website that exposes Web-Services.
The Idea
People generally mention only the movies they enjoyed watching most in their profiles. So for a given set of people, a movie that is mentioned the most by people in their profiles would definitly be more worth watching than the one that is mentioned comparatively lesser number of times.
Hence, I felt that a great way to determine the movies that I must watch would be to aggregate information about the movies from the profile pages of all my friends at Orkut, and all the other people that I am connected to through them. If I can further arrange the obtained set of movies in descending order of the number of times each movie is mentioned, I would be able to easily decide which movie I must go to the DVD store and ask about next!
So, what did I do?
Well, armed with Perl and WWW::Mechanize I just wrote one such hack-toy. So, ladies and gentlemen, spiders and scrappers, lo and behold… I give you the movies with the Top-10 most cited movies in the profiles of my friends and friends-thereof at Orkut.
- Life is Beautiful (20)
- Matrix (20)
- DDLJ (A Bollywood Movie) (15)
- Shawshank Redemption (15)
- Sholay (Another Bollywood Movie) (12)
- Forrest Gump (12)
- Gladiator (11)
- Anand (Yet Another Bollywood Movie) (11)
- Braveheart (9)
- Titanic (8)
How it works
The algorithm to do this is quite simple. You, your friends, friends of your friends, and so-on, can be arranged in a directed graph structure such that the vertices of the graph correspond to the users while the edges correspond to the function has-friend. So, if B is a friend of A, simply draw a directed edge from A to B. Consider the following graph structure:
The graph depicts that I have a friend called A, and A also considers me a friend. A has a friend called C who considers me as a friend. A has another friend called B, who has a friend called D and D considers me as a friend too.
The problem now is to simply visit all the vertices in the graph and build a hash of movie-names and the citation counts by processing the profile page of the user corresponding to the vertex currently being processed.
But hey, there is a slight problem!
Orkut on its home-page tells me that:
You are connected to 5283820 people through 21 friends.
Assuming that I am able to fetch and analyze the profile of each person in 1.5 seconds (Slow Network IO), it would take a total of about 92 days to analyze all the persons that I am connected to at Orkut. 92 days!
Hence, I would want to start at me in the grapth and be able to restrict the analysis to people who are at a maximum distance max_distance from me. Additionaly, I should be able to provide max_distance to the algorithm.
The Algorithm
We can visit all the vertices in the graph by adopting the depth-first-retrieval technique starting at me in the graph, but with a slight modification. We would like to stop the DFS if the distance of the current vertex (user) in the graph is greater than or equal to max_distance (I am at a distance of zero from myself). Hence, we can have a recursive algorithm that would look something like:
sub find_movies {
my $uid = shift;
my $current_distance = shift;
my $max_distance = shift;
process_user($uid);
$visited{$uid} = 1;
# Process friends of current-user, only if his distance is less
# than the maximum allowed distance from the start-user.
if ($current_distance
The process_user function processes the profile page (Profile.aspx) corresponding to the specified user and then updates the hash of movies. The find_friends_of function process the network of friends mentioned at Orkut (FriendsNet.aspx) and returns a list containing the identifiers of the users that are listed as friends of the specified user at Orkut.
find_movies will be initially invoked as find_movies($your_user_id, 0, $max_distance) and the visited hash would be initially empty. After the find_movies function call returns, we will have to sort the values in the movies hash that would contain all the movies mapped to their citation counts and print them out.
Further Work
The next most important thing is to figure out a way to do an approximate comparison of movie names because different people tend to spell the names of non English language movies differently. The text is mostly free-form which brings a number of issues into the picture. E.g. "The Banana", "Banana", "Banana…", "Banana…etc." are all different strings but could be the same movie.
Most Popular Books
I used the same technique to determine the names of the books cited most by my friends and friends thereof in their Orkut profiles. And, here're the Top-5 most-cited books:
- Fountain Head (21)
- Da Vinci Code (16)
- The Godfather (15)
- Catch 22 (13)
- Harry Potter Series (12)
The Cloud of Books
Here is a longer list of books in a cloud-like fashion:
atlas shrugged catch 22 catcher in the rye da vinci code doctors five point someone foo fountain head freedom at midnight god of small things gone with the wind harry potter series inscrutable americans kane and abel lord of the rings lotr love story made in japan midnight's children siddhartha the alchemist the class the godfather to kill a mocking bird
Want an even longer list of movies and books?
A longer list of most-cited books and movies is available here.
Conclusion
There is a lot of precious information that can be discovered by aggregating the information in the profiles of users. This is just one example. I have some other ideas too, so keep watching – Carry On Coding!
Perhaps, Google could give thought to providing statistics about the most popular books and movies among the friends of a given user, and link each of them to Amazon, or another shopping site that exposes Web-Services.
Source Code?!
I have intentionally left out the gory details of my implementation, and not provided the source-code of the script, because a number of people using the script could possibly generate a lot of load on Orkut's servers and I don't want the Orkut people coming after me! :-p
I hope you liked reading this write up…

bheema v
April 26, 2005
Nice, fast work.
And now send me the code.
Sid
April 26, 2005
I just applied the same technique to determine the names of the books most cited by my friends and friends-thereof at Orkut. Check the post for results!
I trust you… I’ll send you the code!
Alpha0
April 27, 2005
Great. Great. Great.
Can you provide an exhaustive list of movies and books?
Few Questions:
1. How did you login using script?
2. How did you make the matches while doing the aggregation of books and movies because people might write ‘The Matrix’ or ‘Matrix’. Some just put “…..” and “all the movies of Arnold”.
How did you tacle this human created chaos?
Sid
April 27, 2005
Great. Great. Great.
Thanks. Thanks. Thanks!
Can you provide an exhaustive list of movies and books?
Check the post for the lists of Top-5 movies and books mentioned in the profiles of my friends and friends-thereof.
Few Questions:
1. How did you login using script?
I used WWW::Mechanize Perl module and logged in by filling out the credentials on the Login.aspx page before accessing any other pages. WWW::Mechanize makes all of it very easy!
2. How did you make the matches while doing the aggregation of books and movies because people might write ‘The Matrix’ or ‘Matrix’. Some just put “…..” and “all the movies of Arnold”.
How did you tacle this human created chaos?
I realized these problems and finding solutions to them falls under “Future Work”. I have some ideas and hope to solve them reasonably in the next version of the script.
Alpha0
April 27, 2005
I mean all books and movies.
Mechanize real bas…d..
Suggestions for matching:
1. remove “^The ”
2.
foreach(word in name1){
if(name2.contains(word))
count1++;
totalNumWords1++;
}
foreach(word in name2){
if(name1.contains(word))
count2++;
totalNumWords2++;
}
matchdegree = (count1+count2)/(totalNumWords1 + totalNumWords2).
Sid
April 28, 2005
BTW I implemented this hack at around 3 AM.
I just updated this post with the section entitled – But hey, there is a slight problem!. And while I was at it, also added the customary Conclusion section.
I just generated a list of all books cited by all the people who are connected to me at Orkut at a maximum distance of 2.
The script generated a CSV file containing the names of the books along with their citation counts and I would mail the 20KB file out to you.
BTW, Get a decent handle “AlphaQ”
Alpha0
April 28, 2005
Cool.
I was just going through some website displaying the top 10 books. And strangely enough “Fountain Head” was nowhere.
Just check it out: http://www.abc.net.au/myfavouritebook/top10/default.htm
Sid
April 28, 2005
I was just going through some website displaying the top 10 books. And strangely enough “Fountain Head” was nowhere.
Well, all the results mean is that “Fountain Head” appeared the most in the list of books mentioned by people in their Orkut profiles who were at a maximum distance of 2 from me. The citation count is the only criteria that decided the order in which my script listed the books, which is to say that, you could arrange them in another order based on another criteria.
The idea is that if most people have mentioned a certain book in their profiles, and those people are linked to me via some friends, those books certainly must be worth reading.
My script just helps you find the set of books listed by people in their profiles, and the number of times each book was cited. If a book is cited more times by another, it must certainly be more worth reading than the other.
I am sure you would be able to find many other lists around where “Fountain Head” would appear first in the list of top books.