Installing PostgreSQL the hard way
Since I joined EDB, I figured it
was time to learn how to install Postgres. I techinically did it when
I installed Discourse. And at
some point I seem to have typed brew intstall postgres
on my
MacBook. But I'm talking installing it the hard way. I'm talking
installing from the source
code. Yeah!
I've always found that the only way to learn new technology is to have a reason to use it. Work-related requirements suffice, but it's even better to have a hobby-related reason. For example, creating a baseball database. If you want to do research on baseball history, a great place to start is Project Retrosheet which has detailed play-by-play files going back to 1916. So if you load this data into a database, you can do all sorts of queries to find out how often (or even whether) certain events have happened. For instance, has anyone ever hit two grand slams in an inning?1
Prerequisites
The instructions for installing Postgres and for setting up a Retrosheet database are incomplete for a very simple reason: prerequisites. People who have correctly set up their environment already have everything they needed. People who are getting started don't know what they are missing. Critically, it takes a lot of effort to figure out which of the many possible prerequisites are actually required. So what's needed is either someone to start from scratch and document everything they needed to install.
I got a PC for my new job and this is the first time in nearly a decade that I've used Windows in earnest. So I have some adjustments to make. I started documenting them over here, but I have a feeling this list will grow over time.2
One of the reasons I'm documenting these steps is that I'm going back and testing the installation process using a pristine Windows virtual machine. That way I can test what things are really needed and what is optional or unnecessary. Or, at least I could in theory and maybe will in practice.
Getting Retrosheet data
Retrosheet's data has an idiosyncratic format, so it helps to have
software to parse it. So the next step is to download the
command-line tools
maintained by the Chadwick Baseball
Bureau. Then I move them to
C:\Users\[username]\bin
and add that to my PATH environment
variable. That way I can use those executables without specifying the
full path. More importantly, so can R scripts.
Since the instructions I'm following use R, the next step is to download a recent version of R. While not strictly necessary, I also grabbed RStudio, which is a fantastic IDE that uses R Markdown. It's a form of literate programming that allows both a data-centric article and the script that generated the data to live in the same document.3
Next you gotta install a bunch of R libraries the script need. Start
with
devtools
and
then run the following commands from the RStudio console window:
install.packages("devtools")
install.packages('h2o')
install.packages('emayili')
install_github("BillPetti/bpettir")
install.packages("purrr")
I made a few tweaks to the script Bill Petti uses to parse and load Retrosheet data. One problem is that several packages must be loaded first:
require(bpettir)
require(future)
require(furrr)
require(purrr)
This can go at the top of the script where the other require
statements live for simplicity. I also changed "~/Desktop/retrosheet"
to "~/retrosheet"
and added setwd
commands
to make sure the script is in the right directory for each input file
it needs. For some reason, RStudio actually puts its output in
~/Documents/retrosheet
. I think it uses a
sandbox for R
that has ~/Documents
for a home directory.
Installing Postgres from source
I checked all the steps right up to populating Postgres database tables. While its certainly possible to use R to analyze the data, my goal was to have a reason to install Postgres. Somewhat annoyingly for the purposes of this post, the short version worked almost perfectly using WSL 2.
The only problem I had was that I don't know the password for the
Administrator account so I couldn't use su
. Thankfully, I found that
I could use sudo su
instead. That
confused me until I read some answers on Ask
Ubuntu. Turns
out su
asks for the password of the account you want to switch to
and sudo
asks for the password of the current account.
Installing prerequisites from the command line
Since I wanted to install Postgres the hard way I decided to try installing it on Windows using the Microsoft toolchain. Since I'm not a big fan of GUI installers, I was pleased to find I could use Chocolatey to install the prerequisites.4 After installing Chocolatey, open a PowerShell window (as an administrator) and run the following command:
choco install ActivePerl git openssl winflexbison python archiver -y
That should install all of the third-party tools you'll need to build
Postgres. You might need to add C:\Perl64\bin
to your PATH
and run
refreshenv
. Then install the Visual Studio tools you'll need to
build:
choco install visualstudio2019community windows-sdk-10.1 -y
(The -y
is optional and simply avoids prompting for confirmation by
automatically saying "yes" to any questions the installer might
ask. You normally want to just say yes.)
While you are at it, you can install all sorts of useful software from the command line:
choco install Emacs vim texlive
And even some of the software I installed by downloading a GUI installer:
choco install R.Studio wsl2
Installing MSVC Build Tools
There is one feature that I haven't figured out how to install via Chocolatey. It's the Visual Studio C++ build tools, which can be installed via the Visual Studio Installer. You can start it up from the Visual Studio 2019 menu system:
Tools => Get Tools and Features...
You can also just run it independently. Either way, you'll want to modify your version of Visual Studio:
Modify => Desktop development with C++ => MSVC v142 - VS 2019 C++ x64/x86 build tools
By the way, I'm including the version numbers I used, but it probably will work fine with newer (and perhaps older) versions. Another constant problem with installing software is that sometimes new releases break stuff that used to work. For instance, Ruby 3 came out recently and it breaks the way I build my site. This is a constant problem with guides about software and closely related to the prerequisites problem. Most of the time "install X" is helpful, but sometimes you need to say "install version Y of X" instead.
Building Postgres
At this point we're finally ready to start building Postgres! I'm
going to show how to do that in PowerShell. But not just any
PowerShell. For this you need to start up Developer PowerShell for VS
2019. Regular PowerShell doesn't work because the build tools aren't
readily available.5 There's also a Developer Command Prompt, but I
prefer PowerShell for a few Unix-like enhancements. The following
commands, create a pg
directory for playing in, move into that
directory, get the Postgres source code, extract the files and move to
directory where we'll build:
mkdir pg
cd pg
wget https://ftp.postgresql.org/pub/source/v13.2/postgresql-13.2.tar.bz2 -OutFile postgresql-13.2.tar.bz2
arc unarchive .\postgresql-13.2.tar.bz2
cd .\postgresql-13.2\src\tools\msvc\
Now PowerShell is really handy, but notice how the wget
command
requires an -OutFile
parameter. If you leave it off, the command
reads the URL, shows some metadata and doesn't save the
file. It's
just an alias for Invoke-WebRequest -Uri
. It's
like they get it, but they don't get it. I'm not sure what's the point
of fetching files from the internet and immediately throwing them away.
Anyway, if you've set everything up right, you should be able to run
the .\build
command. If you haven't installed all the right
prerequisites or aren't running in the Developer PowerShell, you'll
get an error like this:
PS C:\pg\tmp\postgresql-13.2\src\tools\msvc> .\build
Unable to determine Visual Studio version: The nmake command wasn't found. at C:/pg/tmp/postgresql-13.2/src/tools/msvc/Mkvcbuild.pm line 93.
Otherwise, the output will end something like:
FinalizeBuildStatus:
Deleting file ".\Release\euc2004_sjis2004\euc2004_sjis2004.tlog\unsuccessfulbuild".
Touching ".\Release\euc2004_sjis2004\euc2004_sjis2004.tlog\euc2004_sjis2004.lastbuildstate".
Done Building Project "C:\pg\postgresql-13.2\euc2004_sjis2004.vcxproj" (default targets).
Done Building Project "C:\pg\postgresql-13.2\pgsql.sln" (default targets).
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:04:01.81
Installing Postgres
Once the build succeeds, it's time to install:
.\install \pg
By the way, it doesn't much matter where you install Postgres. It can
help to put it somewhere you will remember later. The EDB
installer
puts it in C:\Program Files\PostgreSQL\13
, which probably is a
better choice. Wherever you install it, you'll find a bin
directory
that has all the commands you'll need to start the database. You'll
need to create a data
directory to hold, well, all your data. While
you could run using your own account, it's better to run it under
another user
account. (A
good choice is"postgres".) Either way, make sure your data directories
(but not the executables) are owned by the account that will run Postgres.
In the bin
directory, you can find the commands to initialize a
database directory, start a Postgres server using that data storage
and create a database called "retrosheet":
.\initdb.exe -D \pg\data
.\pg_ctl -D \pg\data -l logfile start
.\createdb.exe retrosheet
Then go ahead and make sure the database exists and is empty:
PS C:\pg\bin> .\psql retrosheet
psql (13.2)
WARNING: Console code page (437) differs from Windows code page (1252)
8-bit characters might not work correctly. See psql reference
page "Notes for Windows users" for details.
Type "help" for help.
retrosheet=# \d
Did not find any relations.
retrosheet=#
Writing Retrosheet data to Postgres
Poor retrosheet database. Let's give it family, shall we?
Now I have to admit something kinda embarrassing here. Building Postgres from source was surprisingly easy, but I made connecting to the database from R incredibly hard. My only excuse is that the last time I had to set up a database connection on my own was over a decade ago. Pretty much the only thing I remembered was that ODBC was involved somehow. When I mentioned how frustrating ODBC still is to install, a colleague kindly pointed out it's not necessary anymore.
What makes this especially embarrassing is that back in 2006, I wrote a Perl script to connect to arbitrary database servers using the DBI module which is exeedingly similar to the R DBI package. Not only do you not need ODBC, DBI is a competitor.6 Connecting to our new database via R is as simple as installing two packages:
install.packages('DBI')
install.packages('RPostgres')
And using the appropriate connection string:
require(DBI)
retrosheet_db <- dbConnect(RPostgres::Postgres(),
dbname = 'retrosheet',
host='localhost',
port='5432',
user='postgres')
By default, you don't need to specify the postgres
password
when connecting via the same machine. If you do need a password, it's
good practice to save it in an environment variable and use something
like password=Sys.getenv('PG_PWD')
so you don't hardcode the
secret. And with that, you can execute the dbWriteTable
and
dbGetQuery
commands list in Bill Petti's
instructions. It
takes a while, but when everything is done, you can see two new tables
have been created:
=retrosheet=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+----------
public | retro_pbp | table | postgres
public | retro_rosters | table | postgres
And if you are really lucky, they will have data:
retrosheet=# select count(*) from retro_pbp;
count
----------
13792467
(1 row)
retrosheet=# select count(*) from retro_rosters;
count
-------
92727
(1 row)
Has anyone ever hit two grand slams in a single inning?
Now we can start doing some baseball research. For instance, if you know that a grand slam is the only way to get 4 RBIs in a single play,7 it's not hard to find all the batters who have hit at least two grand slams in a game:
select game_id, bat_id, count(*)
from retro_pbp p
where rbi_ct = 4
group by game_id, bat_id
having count(*) > 1;
game_id | bat_id | count |
---|---|---|
BOS199905100 | garcn001 | 2 |
CLE196806240 | nortj101 | 2 |
CLE199808140 | hoilc001 | 2 |
LAN199904230 | tatif001 | 2 |
MIL200907270 | willj004 | 2 |
MIN196105090 | gentj101 | 2 |
PHA193605240 | lazzt101 | 2 |
PHA193907042 | taboj101 | 2 |
SFN196607030 | clont101 | 2 |
SLA194607270 | yorkr101 | 2 |
TEX199509040 | ventr001 | 2 |
TEX200307290 | muelb001 | 2 |
WS2197006260 | robif103 | 2 |
(13 rows) |
While you might not understand the significance of this report, I
did. It meant I had not only managed to install Postgres, but also
load it up with over a 100 years of baseball history. One particular
bat_id
stands out as the player I'm looking for. Next, I queried
for all the times when a batter has hit two grand slams in a single
inning:
select game_id, inn_ct, bat_id, count(*)
from retro_pbp
where rbi_ct = 4
group by game_id, inn_ct, bat_id
having count(*) > 1;
game_id | inn_ct | bat_id | count |
---|---|---|---|
LAN199904230 | 3 | tatif001 | 2 |
(1 row) |
Now you can take that game_id
, plug it into the nearest search
engine and discover St. Louis vs the Dodgers on April 23,
1999. There,
in the third inning, Fernando Tatis slams twice. If you feel like it,
you can watch a crude
animation
of how it happened. Think about the series of events that had to
happen just so in order for Tatis to get up with the bases loaded
twice.
Meanwhile, think of all the little things that needed to be initiated to get a working database server up and running. Thankfully software can be quite a bit easier to manipulate than a baseball game.
-
If you've watched as much baseball as I have, you know the answer. Don't spoil it for everyone else! ↩
-
The #1 revelation was that Windows uses the Control key for most of the stuff Mac uses the Command key for. To make life more complicated, Mac keyboards have an Option key that functions like Alt and it's in the same position where the Windows key is on a PC keyboard. Eventually I realized I just want to be able to use my thumb rather than my pinkie to copy and paste things. So making Alt into another Control key and making the Windows key the new Alt solved my problem. ↩
-
That said, it's a little much in R Studio when it also puts the output of the scripts in the window where the source is. Fortunately you can turn that off by going to Tools => Global Options => R Markdown => Basic => Show output line for all R Markdown documents. ↩
-
A big hat-tip to @JourneymanGeek who pointed this out to me. ↩
-
There is certainly some way of setting up environment variables to make this work, but this post is way too long already and I can't be bothered to figure it out. Just use the Developer PowerShell. ↩
-
Interestingly, DBI can connect with a database connected to ODBC. This fact made my searches misleading since it's entirely possible (but rarely desirable) to connect to Postgres that way. ↩
-
I later doublechecked I got the same result if I included only place that cwevent coded as a home run. ↩
If you want to talk with me about community management schedule a meeting!