Importing a remote database archive

Over at College Confidential, we have a lot of data that I’d like to analyze. I can use the Discourse Data Explorer to query the bulk of our data, but there’s another data set that’s currently sitting on an Amazon EC2 instance as a PostgreSQL data dump. So I’d like to populate the PostgreSQL database on my laptop with that data.

My first thought was to just copy the archive and restore it with psql. But then I counted the cost and noticed the dump would take nearly all of my disk space. Presumably I’d need some more space for PostgreSQL to store that data too. So it looked like I’d need to clear out some unused files first.

After some searching around, I discovered SSHFS which mounts a remote directory to behave like a local directory over SSH. The instructions for macOS are fairly straightforward:

Download and install macFUSE.
Download and install SSHFS.
Restart your machine.
Create an empty directory. (mkdir /path/to/mount)
Mount the remote directory via SSHFS.

The last step is the only tricky bit because it depends on how you access the remote machine. If you use an SSH key (you should), you’ll need to tell the command where it may be found.

sshfs username@example.com:/path/on/remote /path/to/remote \
-o IdentityFile=~/.ssh/key_for_server

If that works, you be able to see the remote files using ls /path/to/mount. If you see the data archive, you can start importing it:

psql database_name < /path/to/remote/dump.sql

Setting up the database (prerequisite I should have mentioned before)

If you are like me, you probably didn’t think to create a database first:

createdb database_name

Or, horror of horrors, you might not have PostgreSQL running yet. If you have a Mac, you’ll need to use these commands:

brew install postgresql@14

Next start the server with:

brew services start postgresql@14

When the restore fails

Unfortunately, this SSH file system setup is a bit fragile. If you lose your network connection, you’ll get a message something like:

ERROR:  COPY from stdin failed: aborted because of read failure
CONTEXT:  COPY user_profiles, line 479387
could not read from input file: Device not configured

The CONTEXT line is the key. It tells you what table was being copied and the line it was working on when the restore failed. For better or worse, the COPY command is atomic. So when the command failed, none of the data was committed to the database. It doesn’t really matter what line number it ended on. Only the table it was in the middle of copying matters.

You can re-run the command, if you like, but there is no reason to expect it will make progress. Fortunately, it’s possible to trim the archive to just data that hasn’t been restored. One way to do that is using sed, if you know the line of the command that failed. I use less to search for the COPY command of the table that failed:

/^COPY

Most likely the first result won’t be what you are looking for. You can search forward with n until you find the right table. Or you can add the table name to the search command. Then I use = to tell me the line number. Finally, I copy from that line on into another file:

sed -n '1248450,$p' dump.sql > shorter-dump.sql

This is best done on the remote machine and not via SSHFS on the local machine so that sed doesn’t have to wait for the data to be pull over SSH. Once the sed command has finished, you can run the import again from your local machine:

psql database_name < /path/to/remote/shorter-dump.sql

In any case, there’s a good chance you’ll need to remount the directory:

diskutil umount /path/to/remote

If that doesn’t work, there’s a good chance you’ll need to restart your machine.

Conclusion

This is not the best way to move data around (see the problems caused by a dropped connection), but it helped me get the job done without using too much disk space.