November 4, 2018

Recreating the PwnedPasswords API Locally

The PwnedPasswords API is a wonderful project by Troy Hunt that provides, among other things, the ability to check if a given password has been exposed in a public data breach. For sites that make use of user accounts, the API can be queried when a user is signing up in order to determine if the user is trying to use an exposed password. While this is invaluable for protecting users, it is not always desirable for a commercial site to query a third party API as part of the user signup process. For this reason, and because it'll be fun, I'm going to be walking through how to recreate this API using the publicly available password list so it can be self-hosted.

Choosing an Approach

The most obvious approach that first came to mind was to just dump every known compromised password hash into a database, let the database index it, and then query the database for matches. This would work fine, but I'm generally averse to dumping 21GiB of text into a database if I can avoid it.

An approach that obviously won't work is to just grep the file for the password hash we're looking for each request. Scanning a 21GiB file is going to be several seconds of just IO even on the fastest disks. However, the file is sorted, so we could hypothetically do a much more efficient search by seeking around in the file and homing in on the hashes we want. This approach would require a chunk of custom code for searching the file.

What if we preprocessed the file and took advantage of the filesystem to provide a primitive index into our data? We could use a hierarchy of directories for each character in the hash up to a predetermined point. All hashes that start with C6DB8 can be in the file C/6/D/B/8, which we can just return wholesale as our API response - exactly mimicking the behavior of the real API. For this approach, we don't need any database or even any code! Just a rewrite rule in our webserver to map the API query to the correct file will be enough. This seems like the simplest option with sufficient performance, so let's use it.

Processing the Hashes

First we need to get the list of hashes. This list is kindly made available via both bittorrent and Cloudflare-sponsored direct download. To be nice, let's fetch the file over bittorrent

$ aria2c "https://downloads.pwnedpasswords.com/passwords/pwned-passwords-ordered-by-hash.7z.torrent"

And extract it

$ 7z e pwned-passwords-ordered-by-hash.7z

Now that we have the file full of insecure password hashes, let's decide how we want to break it up. The hashes are all in hexadecimal, so we're going to have 16 directories to create for the first character, then 16 directories in each of those previous 16 for the second characters, and so on. By the end, we'll have created 16^5 files, and 16^5-1 directories, for a total usage of about 2 million inodes (doable!).

To start, let's create all the directories we need.

for i in {{0..9},{A..F}}/{{0..9},{A..F}}/{{0..9},{A..F}}/{{0..9},{A..F}}; do mkdir -p hashes/$i; done

Now let's run a simple awk script that can split the file based on the first few characters in each line, placing each line (sans its first 5 characters) in a new file with a path defined by those removed characters.

awk -F '' '{ print substr($0,6) > "hashes/"$1"/"$2"/"$3"/"$4"/"$5 }' pwned-passwords-ordered-by-hash.txt

Those will both probably take some time to run depending on your disk speed. You can start running the second command while the first one is still running. The first one will almost certainly be faster, so you'll never end up trying to write to a file in a directory that doesn't exist.

Once that's done, we'll have a bunch of files available for serving up from our hashes directory.

$ wc -l hashes/2/1/B/D/1
      497 hashes/2/1/B/D/1
$ head -5 hashes/2/1/B/D/1
0018A45C4D1DEF81644B54AB7F969B88D65:1
00D4F6E8FA6EECAD2A3AA415EEC418D38EC:2
011053FD0102E94D6AE2F8B83D76FAF94F6:1
012A7CA357541F0AC487871FEEC1891C49C:2
0136E006E24E7D152139815FB0FC6A50B15:3

Looking good!

Now all we need to do is serve these files with a little nginx trickery. We'll add this location block to our nginx config:

location ~ ^/hashes/(.)(.)(.)(.)(.)$ {
  default_type text/plain;
  alias /path/to/hashes/$1/$2/$3/$4/$5;
}

Reload nginx and try requesting something like /hashes/21BD1 and you'll get the contents of the file /hashes/2/1/B/D/1, which is the remainder of all hashes that start with 21BD1! This behaves just like the official API for valid requests.

And with that we're done! We now have a performant API that behaves just like the original without needing to continuously run any code or even use a database. Simplicity at its finest.