I wanted to make a reasonable guess of gender for approximately 6,000,000 names. My first thought was to seek out a ready-made API, but the first few prices I found were astronomical.
Since this was a one-time project, I stopped researching and started building.
Step 1: Name Data
I looked at data from both the US Census Bureau and the Social Security Administration. The SSA had better (more recent and more detailed) data, so I went with the SSA’s data, available at http://www.ssa.gov/oact/babynames/limits.html.
Step 2: Create a Database
My goto tool is PostgreSQL, but that seemed a little heavy for a one-off, so I used SQLite. If you’re a developer who has never used SQLite, you should really take an afternoon and learn what it can do for you. There’s a reason it’s embedded all over the place these days.
$ sqlite3 namedb ... CREATE TABLE fnames( name TEXT, gender CHAR(1), cnt INT, yob INT );
Step 3: Load the database
The SSA distributed this as a ZIP file containing 133 CSV files in a pleasantly straightforward format, so dropping them into the SQLite database was dead easy.
3) Load Database #!/usr/bin/env ruby require 'sqlite3' begin dbh = SQLite3::Database.open './namedb' Dir.glob('names/yob*.txt') do |fh| yob = fh.match(/[0-9][0-9]/).to_s File.open(fh).each_with_index do |l,i| print '.' if i % 100 == 0 f_name, gender, cnt = l.chomp.split(',') dbh.execute("INSERT INTO fnames (name, gender, cnt, yob) VALUES (?, ?, ?, ?)", [f_name, gender, cnt, yob]) end puts yob end rescue SQLite3::Exception => e puts "SQLITE Exception: " puts e ensure dbh.close if dbh end
Step 4: Create Indexes
The above created a usable database, but I had a couple million queries to do, so I needed to make sure it was reasonably well indexed. I created a simple index on the relevant columns, and used EXPLAIN to ensure that it was actually being used as expected.
CREATE INDEX ng_idx ON fnames (name, gender);
Step 5: Start Querying
I integrated the following query into my app, and got the data I needed.
SELECT x.gender, x.gender_cnt * 1.0 / y.total_cnt AS gender_pct, y.total_cnt FROM ( SELECT gender, sum(cnt) AS gender_cnt FROM fnames WHERE name=? GROUP BY gender ORDER BY gender_cnt DESC LIMIT 1 ) AS x, ( SELECT sum(cnt) AS total_cnt FROM fnames WHERE name=?) AS y;
Step 6: Goof around with the data for a moment:
How many babies are named Beyonce?
1998 18 1999 67 2000 197 2001 353 2002 153 2003 206 2004 175 2005 106 2006 106 2007 185 2008 105 2009 79 2010 51 2011 39 2012 36
Was there a spike in Baracks after President Obama was elected?
2007 5 2008 52 2009 69 2010 28 2011 15 2012 16
Was there a spike in Mitts?
1919 6 2004 6 2008 5 2012 8
Total implementation time: 1h15m