I wanted to make a reasonable guess of gender for approximately 6,000,000 names. My first thought was to seek out a ready-made API, but the first few prices I found were astronomical.

Since this was a one-time project, I stopped researching and started building.
Step 1: Name Data
I looked at data from both the US Census Bureau and the Social Security Administration. The SSA had better (more recent and more detailed) data, so I went with the SSA’s data, available at http://www.ssa.gov/oact/babynames/limits.html.
Step 2: Create a Database
My goto tool is PostgreSQL, but that seemed a little heavy for a one-off, so I used SQLite. If you’re a developer who has never used SQLite, you should really take an afternoon and learn what it can do for you. There’s a reason it’s embedded all over the place these days.
$ sqlite3 namedb
...
CREATE TABLE fnames(
name TEXT,
gender CHAR(1),
cnt INT,
yob INT
);
Step 3: Load the database
The SSA distributed this as a ZIP file containing 133 CSV files in a pleasantly straightforward format, so dropping them into the SQLite database was dead easy.
3) Load Database
#!/usr/bin/env ruby
require 'sqlite3'
begin
dbh = SQLite3::Database.open './namedb'
Dir.glob('names/yob*.txt') do |fh|
yob = fh.match(/[12][890][0-9][0-9]/).to_s
File.open(fh).each_with_index do |l,i|
print '.' if i % 100 == 0
f_name, gender, cnt = l.chomp.split(',')
dbh.execute("INSERT INTO fnames (name, gender, cnt, yob) VALUES (?, ?, ?, ?)", [f_name, gender, cnt, yob])
end
puts yob
end
rescue SQLite3::Exception => e
puts "SQLITE Exception: "
puts e
ensure
dbh.close if dbh
end
Step 4: Create Indexes
The above created a usable database, but I had a couple million queries to do, so I needed to make sure it was reasonably well indexed. I created a simple index on the relevant columns, and used EXPLAIN to ensure that it was actually being used as expected.
CREATE INDEX ng_idx ON fnames (name, gender);
Step 5: Start Querying
I integrated the following query into my app, and got the data I needed.
SELECT
x.gender, x.gender_cnt * 1.0 / y.total_cnt AS gender_pct, y.total_cnt
FROM
(
SELECT gender, sum(cnt) AS gender_cnt
FROM fnames
WHERE name=?
GROUP BY gender
ORDER BY gender_cnt DESC
LIMIT 1
) AS x,
(
SELECT sum(cnt) AS total_cnt
FROM fnames
WHERE name=?) AS y;
Step 6: Goof around with the data for a moment:
How many babies are named Beyonce?
1998 18
1999 67
2000 197
2001 353
2002 153
2003 206
2004 175
2005 106
2006 106
2007 185
2008 105
2009 79
2010 51
2011 39
2012 36
Was there a spike in Baracks after President Obama was elected?
2007 5
2008 52
2009 69
2010 28
2011 15
2012 16
Was there a spike in Mitts?
1919 6
2004 6
2008 5
2012 8
Total implementation time: 1h15m



























