Kevin Way

View Original

Finding Feld's Finest Thoughts

I wanted to read Brad Feld's best blog posts, but where to begin? He's incredibly prolific, having racked up thousands of posts in the past eight years. I certainly didn't have the time to comb through that many posts in search of the best, so I decided to see if I could hack something together that would use existing, public social proof to help me find the diamonds.  My first thought was that I'd use the PostRank API, but  it no longer existed, so I decided to roll my own social scoring algorithm. If you just want to see Brad's best posts, scroll to the bottom of this article.

If you want to see how I made the sausage, keep reading.  It's a quick and dirty implementation, but the general concepts are sound.

Methodology

Step 1: Get a complete list of blog posts

There are a lot of ways to do this, but I decided to start with Brad's sitemap, which was located by looking in robots.txt. As such, I wrote a short script to download that sitemap, uncompress it, fetch the additional sitemaps that it referred to, and find the posts I was interested in. I processed the XML with sed and grep to get the post URLS while filtering what appeared to be the artifacts of a now-resolved intrusion on Feld.com.

#!/bin/sh
TMP=`mktemp -t $0.$$.XXXXXXXXXX`
function clean {
 rm $TMP
 exit
}
trap clean SIGHUP SIGINT SIGTERM
wget -q -O $TMP http://thisurlwillbestalebythetimeyoureadthis.com/web_sitemap_hash.gz
gzcat $TMP | grep loc | sed -e 's/.*<loc>//' -e 's/<\/loc>//' | xargs -n 1 wget -q -O /dev/stdout {} \
 | gzcat - | grep loc | sed -e 's/.*<loc>//' -e 's/<\/loc>//' \
 | egrep '^http://www.feld.com/wp/archives/2[0-9]{3}/[0-9]{2}/' | egrep 'html$' > feldurls.txt
clean

This generated a file, feldurls.txt with the URLs of 4,636 posts.

Step 2: Gather social data

I wrote a short Ruby script to count the number of times each of his articles was shared on Twitter, LinkedIn, Delicious, and Facebook, and to save that off into a JSON blob for later processing. Here's the script I used. I started it, and then went for a short hike.

#!/usr/bin/env ruby
require 'rubygems'
require 'curb'
require 'CGI'
require 'json'
urls = {}
ARGF.each_with_index do |url, i|
 $stderr.puts "#{i}: #{url}"
 begin
 urls[url] = {}

 urls[url][:year] = url.match(/archives\/([0-9]{4})\/[0-9]{2}/)[1].to_i
# Twitter
 twitter_url = "http://urls.api.twitter.com/1/urls/count.json?url=#{CGI::escape(url.chomp)}"
 c = Curl::Easy.new(twitter_url) do |curl|
 curl.headers["User-Agent"] = "feldfinder-1.0"
 end 
 c.perform 
 urls[url][:twitter] = JSON.parse(c.body_str)['count']
# Facebook
 fb_url = "http://graph.facebook.com/#{url.chomp}"
 c = Curl::Easy.new(fb_url) do |curl|
 curl.headers["User-Agent"] = "feldfinder-1.0"
 end 
 c.perform 
 urls[url][:fb] = JSON.parse(c.body_str)['shares'].nil? ? 0 : JSON.parse(c.body_str)['shares']
# LinkedIn
 linked_url = "http://www.linkedin.com/countserv/count/share?url=#{CGI::escape(url.chomp)}&format=json"
 c = Curl::Easy.new(linked_url) do |curl|
 curl.headers["User-Agent"] = "feldfinder-1.0"
 end 
 c.perform 
 urls[url][:linked] = JSON.parse(c.body_str)['count']
# Delicious
 delicious_url = "http://badges.del.icio.us/feeds/json/url/data?url=#{CGI::escape(url.chomp)}"
 c = Curl::Easy.new(delicious_url) do |curl|
 curl.headers["User-Agent"] = "feldfinder-1.0"
 end 
 c.perform 
 urls[url][:delicious] = JSON.parse(c.body_str)[0]['total_posts'] rescue 0
$stderr.puts urls[url].to_json

 rescue => e
 $stderr.puts "ERROR: #{e}"
 end
end
puts urls.to_json

Step 3: Convert the data

I arrived home from my hike, and found Step 2 had completed execution, and produced a JSON object that had the year of publishing and number of shares on each site for every URL. I decided to convert this to CSV so I could manipulate it in a spreadsheet

#!/usr/bin/env ruby
require 'rubygems'
require 'json'
urls = JSON.parse(File.read("feldout.json"))
puts "url,year,twitter,fb,linked,delicious"
urls.each do |k, v|
 puts "#{k.chomp},#{v['year']},#{v['twitter']},#{v['fb']},#{v['linked']},#{v['delicious']}"
end

That gave me a CSV, which I uploaded to Google Docs.

Step 4: Scoring the data

After uploading the file to Google Docs, I decided to create a score for each post. My goal here was to find the simplest method that would work reasonably well in this application. Counting raw shares wouldn't be useful because items from 2012 were, naturally, going to have been shared far more often than those from 2004, regardless of quality. The formula I decided upon was:

(post's raw twitter shares) / (average number of twitter shares for that year) + (post's raw facebook shares)  / (average number of facebook shares for that year) + (post's raw linkedin shares)  / (average number of linkedin shares for that year) + (post's raw delicious shares) / (average number of delicious shares for that year)

Results

Brad Feld's Fifteen Finest posts are:

15. The torturous world of powerpoint [2004]

14. Venture Capital deal algebra [2004]

13. Discovering work life balance [2005]

12. The difference between Christmas and Chanukah [2005]

11. What's the best structure for a pre-VC investment [2006]

10. How convertible debt works [2011]

9. Sample board meeting minutes [2006]

8. The best board meetings [2009]

7. Zynga Texas Holdem Poker on MySpace [2008]

6. Fear is the mindkiller [2007]

5. The Treadputer [2006] 

4. Term sheet series wrap up [2005]

3. Why most VCs don't sign NDAs [2006]

2. CTO vs VP Engineering [2007]

1. Revisiting the term sheet [2008]

And here is the complete Google Doc.

Is that all?

I did this because I wanted to find some good Brad Feld posts, but the approach is also useful in any number of situations. I’ve also used it to help survey industry publications, competitors, and potential partners.