Guessing Genders

I wanted to make a reasonable guess of gender for approximately 6,000,000 names. My first thought was to seek out a ready-made API, but the first few prices I found were astronomical.

Screen Shot 2013-05-17 at 1.33.33 AM

Since this was a one-time project, I stopped researching and started building.

Step 1: Name Data

I looked at data from both the US Census Bureau and the Social Security Administration. The SSA had better (more recent and more detailed) data, so I went with the SSA’s data, available at http://www.ssa.gov/oact/babynames/limits.html.

Step 2: Create a Database

My goto tool is PostgreSQL, but that seemed a little heavy for a one-off, so I used SQLite. If you’re a developer who has never used SQLite, you should really take an afternoon and learn what it can do for you. There’s a reason it’s embedded all over the place these days.


$ sqlite3 namedb
...		
CREATE TABLE fnames(
	name TEXT,
	gender CHAR(1),
	cnt INT,
	yob INT
);

Step 3: Load the database

The SSA distributed this as a ZIP file containing 133 CSV files in a pleasantly straightforward format, so dropping them into the SQLite database was dead easy.


3) Load Database

#!/usr/bin/env ruby

require 'sqlite3'

begin
  dbh = SQLite3::Database.open './namedb'

  Dir.glob('names/yob*.txt') do |fh|
    yob = fh.match(/[12][890][0-9][0-9]/).to_s
    
    File.open(fh).each_with_index do |l,i|
      print '.' if i % 100 == 0
      
      f_name, gender, cnt = l.chomp.split(',')      
      dbh.execute("INSERT INTO fnames (name, gender, cnt, yob) VALUES (?, ?, ?, ?)", [f_name, gender, cnt, yob])
    end
    puts yob
  end
rescue SQLite3::Exception => e 
  puts "SQLITE Exception: "
  puts e
ensure
  dbh.close if dbh
end

Step 4: Create Indexes

The above created a usable database, but I had a couple million queries to do, so I needed to make sure it was reasonably well indexed. I created a simple index on the relevant columns, and used EXPLAIN to ensure that it was actually being used as expected.


CREATE INDEX ng_idx ON fnames (name, gender);

Step 5: Start Querying

I integrated the following query into my app, and got the data I needed.


SELECT 
	x.gender, x.gender_cnt * 1.0 / y.total_cnt AS gender_pct, y.total_cnt
FROM
	(
		SELECT gender, sum(cnt) AS gender_cnt 
		FROM fnames 
		WHERE name=? 
		GROUP BY gender 
		ORDER BY gender_cnt DESC
		LIMIT 1
	) AS x,
	(
	SELECT sum(cnt) AS total_cnt
	FROM fnames
	WHERE name=?) AS y;

Step 6: Goof around with the data for a moment:

How many babies are named Beyonce?


1998        18        
1999        67        
2000        197       
2001        353       
2002        153       
2003        206       
2004        175       
2005        106       
2006        106       
2007        185       
2008        105       
2009        79        
2010        51        
2011        39        
2012        36 

Was there a spike in Baracks after President Obama was elected?


2007        5         
2008        52        
2009        69        
2010        28        
2011        15        
2012        16     

Was there a spike in Mitts?


1919        6         
2004        6         
2008        5         
2012        8 

Total implementation time: 1h15m

Read full story

Exploring Twitter

My twitter followers are mostly strangers. As such, I wanted to know more about them. In particular, I wanted to learn what interests the people who follow me, and how does that differ what interests me personally?

Gathering Data

I started with Klout. Not the famous numeric score, but rather the influencer information that’s available via the API. I decided to measure my own expressed interests by counting to see how many people I followed were designated as influential on each topic.

I setup developer accounts on Twitter and Klout, and got the desired info with the following script:


#!/usr/bin/env ruby
require 'klout'
require 'rubygems'
require 'twitter'

Twitter.configure do |config|
  config.consumer_key = 
  config.consumer_secret = 
  config.oauth_token = 
  config.oauth_token_secret = 
end

Klout.api_key = 

topics = {}

Twitter.friend_ids('kevindway').each do |f_id|
        begin
                klout_id = Klout::Identity.find_by_twitter_id(f_id)
                user = Klout::User.new(klout_id['id'])

                user.topics.each do |x|
                        topic = x['displayName']
                        topics[topic].nil? ? topics[topic] = 1 : topics[topic] += 1
                end
        rescue Exception => e
                next
        end
end

topics.map { |k,v| puts "#{k},#{v}" }

I dropped the output of that script into Excel, and saw the following mostly unsurprising snapshot of my interests.

Business	158
Technology	97
Entrepreneurship	69
Social Media	54
Venture Capital	48
Finance	25
Science & Technology	22
Marketing	21
Economics	19
Education	18
Philadelphia	17
Product Management	16
Journalism	16
Markets	16
Books	12
Software	12
Music	12
Television	11
Movies	11
Videos	10
Money	10
Government	10
Facebook	10
Investing	10
Comedy	9
Angel Investing	8
Silicon Valley	8
Boston	7
NASA	7
Design	7
Management	6
Advertising	6
Internet Startups	6
User Experience Design	6
Creativity	6
Astronomy	6

I then repeated the procedure substituting followers for friends, and dropped that into Excel as well.

Looking For Differences

My first thought was to use Excel’s conditional formatting to identify topics that were in one interest list but not the other.

All Green

This wasn’t particularly informative. Virtually every topic is in both lists of interest. I decided to narrow that down by considering only the top 50 topics, instead of the 400+ topics that I had found.

Some Red

That’s a bit more useful.

Lessons Learned

I learned that my Twitter followers probably won’t appreciate if I nerd out Science or Astronomy.

I also learned that my followers are quite diverse and that this diversity expresses itself in nearly every imaginable dimension, down to musical taste. (A Soulja Boy expert follows me? Really?)

I was a tad disappointed that the disparities weren’t more surprising or actionable, but it was still a worthwhile exploration.

Read full story