Prototyping with shell scripts
I posted a thread on Twitter about how I was adding my Board Game Geek and Steam reviews to this blog. I mentioned how much I enjoy using shell scripts for prototyping ideas and Steve Bennett replied:
Huh, I find bash much worse for prototyping because no good debuggers, no libraries etc. But maybe if you are super familiar with it…
Since my answer doesn’t fit in 240 characters, I decided to write a post.
To start, being familiar with the Unix shell does make a difference in how easy it is to build prototypes. I wouldn’t necessarily say it’s worth learning shell scripting just for that purpose. But it is an incredibly powerful tool that more people (especially programmers) could benefit from using. For instance, I’m writing a series about cleaning up a Git repository and we needed to drop down into the command line many times in order to finish the work. Expertise with shell commands regularly saves me hours of tedious work.1
To demonstrate how I deal with the lack of debugger, libraries and
other niceties, I’ll walk through building a script to import my Steam
reviews. The first step is to find my reviews on the web. Luckily, there
is a view of just my reviews:
https://steamcommunity.com/id/jlericson/recommended
. There
are several ways to pull data from the internet, but the two most common
are:
You might recognize curl
from libcurl
. Both
were created
by Daniel Stenberg. It’s reasonable to think of curl
as
a wrapper for using libcurl
on the command line. But it
would be equally valid to think of libcurl
as tool for
using curl
’s functionality within other programming
languages. There are many commandline tools that are developed in tandem
with libraries. Here’s a sample of commands I’ve used in my shell
scripts that mirror functionality often found in libraries:
tr
,sed
,awk
andperl
—String manipulation (and more!).pandoc
—Markup format conversion.- ImageMagick—Image manipulation.
ffmpeg
—Audio and video editing.jq
andyq
—JSON and YAML processing.bc
anddc
—Math libraries.
Unix
philosophy means these commands interact with each other in ways
their creators could never imagine. I’ll also demonstrate how easy it
can be to create a script that can be combined with other commands to
accomplish quite complex tasks. I’m going to use curl
to
pull down a list of my reviews. Just passing a URL to curl
outputs the contents of the page:
$ curl https://steamcommunity.com/id/jlericson/recommended
If you run that command, you should see an HTML document that shows my reviews. As it happens, the site is down as I write this. So I can show off a critical debugging tool:
$ curl https://steamcommunity.com/id/jlericson/recommended \
| less
Piping2 output to less
allows you to examine and search at your leisure. So I was able to
search for “error” by typing /error
and cycle through the
results with n
until I found:
<div id="error"><img id="image" src="#" alt="sticker"></div>
<div id="headline">Something Went Wrong</div>
<div id="message">
We were unable to service your request. Please try again later.
</div>
Checking Down for Everyone or Just Me proved the problem wasn’t that I’d violated a rate limit or somesuch.3
Related debugging tools:
head
—Show the first-n
lines of input.tail
—Show the last-n
lines of input. Or, with the-f
option, monitor lines appended to a file.tee
—Pipe input to a file and to standard output.grep
—Filter input using a regular expression.
When Steam isn’t down, I can also use less
to help me
pinpoint the data I want to extract. For now, I just want to get a list
of URLs of my reviews. With a little searching, I found:
<div class="title"><a href="https://steamcommunity.com/id/jlericson/recommended/246620/">Not Recommended</a></div>
This markup is a bit more complicated than the read
trick can handle. So I’ll need to use one of the dozens HTML of
parsers available for the command line. I chose pup
, which has
a simple, flexible syntax. After a few tries, I extracted just the URLs
I need:
$ curl https://steamcommunity.com/id/jlericson/recommended \
| pup '.title > a attr{href}'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 86426 0 86426 0 0 153k 0 --:--:-- --:--:-- --:--:-- 153k
https://steamcommunity.com/id/jlericson/recommended/246620/
https://steamcommunity.com/id/jlericson/recommended/646570/
https://steamcommunity.com/id/jlericson/recommended/951940/
https://steamcommunity.com/id/jlericson/recommended/226960/
https://steamcommunity.com/id/jlericson/recommended/225540/
https://steamcommunity.com/id/jlericson/recommended/357070/
https://steamcommunity.com/id/jlericson/recommended/91200/
https://steamcommunity.com/id/jlericson/recommended/862920/
https://steamcommunity.com/id/jlericson/recommended/374940/
https://steamcommunity.com/id/jlericson/recommended/613120/
The first thing to do is eliminate the progress meter that
curl
outputs. Typically the option is either
-q
(for “quiet”) or -s
(for “silent”).
Checking man curl
, the answer in this case is
-s
.
Meanwhile, the pup
command looks for HTML tags with the
“title” class (.title
) with a child (>
)
hyperlink (a
). Then it prints the “href” attribute
(attr{href}
). There are other ways to get to this data, but
as long as Steam doesn’t change the template for user review listings,
this’ll work.
I have 42 reviews, but this command only shows 10. That’s because
Steam puts 10 reviews on a page. To get to the second page, I can append
?p=2
on the URL:
curl -s https://steamcommunity.com/id/jlericson/recommended?p=2 \
| pup '.title > a attr{href}'
zsh: no matches found: https://steamcommunity.com/id/jlericson/recommended?p=2
EOF
Now we see one of the truly annoying aspects of shell scripting:
quote madness. The problem is that zsh
(and other Bourne
shell variants) interprets ?
as a single-character
wildcard. Since I don’t have any files that match, the shell rejects the
command. Fortunately, it’s easy to fix with judicious use of
single-quotation marks:
$ curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=2' \
| pup '.title > a attr{href}'
Sadly, it’s not always so easy to fix. What if you need to quote
a string that includes a quotation mark? It’s enough of a problem
when writing perl
oneliners that Perl provides an array of
quote-like operators. Debugging this sort of problem really is shell
hell.
Now that I can access each of the 5 pages individually, I’m going to
want to run a command to get the links from all 5 pages. I could write a
loop in shell, but as it happens, curl
has a trick up its
sleeve:
$ curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=[1-5]' \
| pup '.title > a attr{href}'
This time I used [1-5]
, which curl
expands
to five URLs:
https://steamcommunity.com/id/jlericson/recommended?p=1
https://steamcommunity.com/id/jlericson/recommended?p=2
https://steamcommunity.com/id/jlericson/recommended?p=3
https://steamcommunity.com/id/jlericson/recommended?p=4
https://steamcommunity.com/id/jlericson/recommended?p=5
I can verify that the new command produces 42 URLs by piping the out
put to the wc
(short for “word count”) utility:
$ curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=[1-5]' \
| pup '.title > a attr{href}' \
| wc -l
42
This sort of testing of partial results is extremely common. More
formally, the Unix shell implements a read-eval-print
loop REPL). In fact, it’s the world’s tightest read-eval-print loop
since each command is evaluated instantly and automatically after the
user presses “Return”. Sometimes (such as when using curl
to connect to a slow service) the results take a moment to be printed,
but the shell doesn’t add any delay. Often I spot my mistake, edit the
command and try again within seconds of seeing the output. It’s an ideal
environment for building a quick prototype.
Eventually I’ll want to save commands into a file so that I can run
them later. It can be helpful to use the history
command to
find previously executed commands. Then copy them to a file with the
appropriate hashbang:
#!/usr/bin/env zsh
I prefer KornShell
over bash, but these days Zsh
would be my choice. After you save that file and make it executable, it
can be run just like any other shell command. To jump right to the
chase, I wrote steam_review_import.sh
which takes the URL of a Steam review and converts it to a Markdown file
in the _posts directory:
$ ./steam_review_import.sh https://steamcommunity.com/id/jlericson/recommended/250600/
https://steamcommunity.com/id/jlericson/recommended/250600/
$ ls -l _posts/2018-08-24-the_plan.md
-rw-r--r--@ 1 jericson staff 499 Aug 29 21:03 _posts/2018-08-24-the_plan.md
The import script constructs the filename from the date the review was posted and the name of the game reviewed. Next, it writes the YAML front matter that Jekyll is looking for. It also converts the HTML body of the review into something more like Markdown.4 I tried to handle spoiler tags, but there are a few oddities I haven’t yet resolved. Good enough for government work, as we used to say when I worked for the government.
So I have a command pipeline that spits out the URLs of my reviews
and a script that takes a URL of a review and turns it into a post.
Again, I might use a for
loop, but there’s an even better
way:
curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=[1-5]' \
| pup '.title > a attr{href}' \
| xargs -n 1 ./steam_review_import.sh
The xargs
command takes each line of the output of the
curl | pup
pipeline and passes it to my shell script. Since
the script is written to take just one value,5 the
-n 1
option tells xargs
to only pass one
parameter at a time. So this spawns off one execution of the script per
review URL.6
Obviously this is a topic I’m excited about. I’d love to write about how useful the xtrace option is or how it can be useful create shell commands to pipe into another shell instance. Instead I’ll summarize the things that make shell commands useful for prototyping:
- If you already use the command line regularly, writing scripts doesn’t demand much more from you.
- There’s a command for just about any specialized task you can imagine. Often the command is just a wrapper for a library used in another programming language.
- Having a tight read-eval-print loop makes finding and fixing mistakes much quicker.
- Piping intermediate data into a tool like
less
simplifies the process of writing code to process that data. - Pipes make chaining commands easy and potentially quite powerful.
Even with the disadvantage of need to work around quoting oddities, I find prototyping in shell to be quicker than in other languages. Quite often I find rewriting is another language goes smoother because of insight into my data. For some tasks (editing audio, for instance) I don’t see much advantage to using another language at all.
Will I continue to write prototypes on the command line? Yes I shell.
Footnotes:
In particular, text manipulation. It can be quite aggravating to see people struggle with manual tasks that could be solved with a one-liner.↩︎
The
|
character is called a pipe. It feeds the output of the first command into the input of the second command. For this post, I use the\
line continuation character to put the pipe command on a new line. It’s typical to just append a pipe to the end of a command, but using a newline is a lot clearer since you don’t have to scroll. Pipes are a useful metaphor that allows commands to be combined without the need for intermediate temporary files.↩︎Or I somehow executed a DoS attack, which seems unlikely.↩︎
Since Markdown is allowed to contain HTML tags, both the original body and the modified result are techinically Markdown. As I write this, it occurs to me that I should have used
pandoc
to do a more comprehensive conversion. Next time!↩︎There’s an easyish way around this. Just add a loop to the script:
for f in "$@"; do # Use $f instead of $1 done
That way we can drop the
-n 1
option andxargs
will pass all 42 files to the script. Another option is to use a stdin as a fallback:for f in ${*:-`cat`}; do # Use $f instead of $1 done
That way we can pipe into the script without using
xargs
at all.↩︎Sorry about all the footnotes, but there are so many cool things you can do with
xargs
. Since you are spawning all these independent processes, you can do a bit of parallel processing with the-P
command. For maximum throughput, use the number of cores your machine has available.↩︎