Prototyping with shell scripts

I posted a thread on Twitter about how I was adding my Board Game Geek and Steam reviews to this blog. I mentioned how much I enjoy using shell scripts for prototyping ideas and Steve Bennett replied:

Huh, I find bash much worse for prototyping because no good debuggers, no libraries etc. But maybe if you are super familiar with it…

Since my answer doesn’t fit in 240 characters, I decided to write a post.

To start, being familiar with the Unix shell does make a difference in how easy it is to build prototypes. I wouldn’t necessarily say it’s worth learning shell scripting just for that purpose. But it is an incredibly powerful tool that more people (especially programmers) could benefit from using. For instance, I’m writing a series about cleaning up a Git repository and we needed to drop down into the command line many times in order to finish the work. Expertise with shell commands regularly saves me hours of tedious work.¹

To demonstrate how I deal with the lack of debugger, libraries and other niceties, I’ll walk through building a script to import my Steam reviews. The first step is to find my reviews on the web. Luckily, there is a view of just my reviews: https://steamcommunity.com/id/jlericson/recommended. There are several ways to pull data from the internet, but the two most common are:

wget and
curl.

You might recognize curl from libcurl. Both were created by Daniel Stenberg. It’s reasonable to think of curl as a wrapper for using libcurl on the command line. But it would be equally valid to think of libcurl as tool for using curl’s functionality within other programming languages. There are many commandline tools that are developed in tandem with libraries. Here’s a sample of commands I’ve used in my shell scripts that mirror functionality often found in libraries:

tr, sed, awk and perl—String manipulation (and more!).
pandoc—Markup format conversion.
ImageMagick—Image manipulation.
ffmpeg—Audio and video editing.
jq and yq—JSON and YAML processing.
bc and dc—Math libraries.

Unix philosophy means these commands interact with each other in ways their creators could never imagine. I’ll also demonstrate how easy it can be to create a script that can be combined with other commands to accomplish quite complex tasks. I’m going to use curl to pull down a list of my reviews. Just passing a URL to curl outputs the contents of the page:

$ curl https://steamcommunity.com/id/jlericson/recommended

If you run that command, you should see an HTML document that shows my reviews. As it happens, the site is down as I write this. So I can show off a critical debugging tool:

$ curl https://steamcommunity.com/id/jlericson/recommended \
| less

Piping² output to less allows you to examine and search at your leisure. So I was able to search for “error” by typing /error and cycle through the results with n until I found:

    <div id="error"><img id="image" src="#" alt="sticker"></div>
    <div id="headline">Something Went Wrong</div>
    <div id="message">
        We were unable to service your request. Please try again later.
    </div>

Checking Down for Everyone or Just Me proved the problem wasn’t that I’d violated a rate limit or somesuch.³

Related debugging tools:

head—Show the first -n lines of input.
tail—Show the last -n lines of input. Or, with the -f option, monitor lines appended to a file.
tee—Pipe input to a file and to standard output.
grep—Filter input using a regular expression.

When Steam isn’t down, I can also use less to help me pinpoint the data I want to extract. For now, I just want to get a list of URLs of my reviews. With a little searching, I found:

<div class="title"><a href="https://steamcommunity.com/id/jlericson/recommended/246620/">Not Recommended</a></div>

This markup is a bit more complicated than the read trick can handle. So I’ll need to use one of the dozens HTML of parsers available for the command line. I chose pup, which has a simple, flexible syntax. After a few tries, I extracted just the URLs I need:

$ curl https://steamcommunity.com/id/jlericson/recommended \
| pup '.title > a  attr{href}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 86426    0 86426    0     0   153k      0 --:--:-- --:--:-- --:--:--  153k
https://steamcommunity.com/id/jlericson/recommended/246620/
https://steamcommunity.com/id/jlericson/recommended/646570/
https://steamcommunity.com/id/jlericson/recommended/951940/
https://steamcommunity.com/id/jlericson/recommended/226960/
https://steamcommunity.com/id/jlericson/recommended/225540/
https://steamcommunity.com/id/jlericson/recommended/357070/
https://steamcommunity.com/id/jlericson/recommended/91200/
https://steamcommunity.com/id/jlericson/recommended/862920/
https://steamcommunity.com/id/jlericson/recommended/374940/
https://steamcommunity.com/id/jlericson/recommended/613120/

The first thing to do is eliminate the progress meter that curl outputs. Typically the option is either -q (for “quiet”) or -s (for “silent”). Checking man curl, the answer in this case is -s.

Meanwhile, the pup command looks for HTML tags with the “title” class (.title) with a child (>) hyperlink (a). Then it prints the “href” attribute (attr{href}). There are other ways to get to this data, but as long as Steam doesn’t change the template for user review listings, this’ll work.

I have 42 reviews, but this command only shows 10. That’s because Steam puts 10 reviews on a page. To get to the second page, I can append ?p=2 on the URL:

curl -s https://steamcommunity.com/id/jlericson/recommended?p=2 \
| pup '.title > a  attr{href}'

zsh: no matches found: https://steamcommunity.com/id/jlericson/recommended?p=2
EOF

Now we see one of the truly annoying aspects of shell scripting: quote madness. The problem is that zsh (and other Bourne shell variants) interprets ? as a single-character wildcard. Since I don’t have any files that match, the shell rejects the command. Fortunately, it’s easy to fix with judicious use of single-quotation marks:

$ curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=2' \
| pup '.title > a  attr{href}'

Sadly, it’s not always so easy to fix. What if you need to quote a string that includes a quotation mark? It’s enough of a problem when writing perl oneliners that Perl provides an array of quote-like operators. Debugging this sort of problem really is shell hell.

Now that I can access each of the 5 pages individually, I’m going to want to run a command to get the links from all 5 pages. I could write a loop in shell, but as it happens, curl has a trick up its sleeve:

$ curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=[1-5]' \
| pup '.title > a  attr{href}'

This time I used [1-5], which curl expands to five URLs:

https://steamcommunity.com/id/jlericson/recommended?p=1
https://steamcommunity.com/id/jlericson/recommended?p=2
https://steamcommunity.com/id/jlericson/recommended?p=3
https://steamcommunity.com/id/jlericson/recommended?p=4
https://steamcommunity.com/id/jlericson/recommended?p=5

I can verify that the new command produces 42 URLs by piping the out put to the wc (short for “word count”) utility:

$ curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=[1-5]' \
| pup '.title > a  attr{href}' \
| wc -l

      42

This sort of testing of partial results is extremely common. More formally, the Unix shell implements a read-eval-print loop REPL). In fact, it’s the world’s tightest read-eval-print loop since each command is evaluated instantly and automatically after the user presses “Return”. Sometimes (such as when using curl to connect to a slow service) the results take a moment to be printed, but the shell doesn’t add any delay. Often I spot my mistake, edit the command and try again within seconds of seeing the output. It’s an ideal environment for building a quick prototype.

Eventually I’ll want to save commands into a file so that I can run them later. It can be helpful to use the history command to find previously executed commands. Then copy them to a file with the appropriate hashbang:

#!/usr/bin/env zsh

I prefer KornShell over bash, but these days Zsh would be my choice. After you save that file and make it executable, it can be run just like any other shell command. To jump right to the chase, I wrote steam_review_import.sh which takes the URL of a Steam review and converts it to a Markdown file in the _posts directory:

$ ./steam_review_import.sh https://steamcommunity.com/id/jlericson/recommended/250600/
https://steamcommunity.com/id/jlericson/recommended/250600/

$ ls -l _posts/2018-08-24-the_plan.md
-rw-r--r--@ 1 jericson  staff  499 Aug 29 21:03 _posts/2018-08-24-the_plan.md

The import script constructs the filename from the date the review was posted and the name of the game reviewed. Next, it writes the YAML front matter that Jekyll is looking for. It also converts the HTML body of the review into something more like Markdown.⁴ I tried to handle spoiler tags, but there are a few oddities I haven’t yet resolved. Good enough for government work, as we used to say when I worked for the government.

So I have a command pipeline that spits out the URLs of my reviews and a script that takes a URL of a review and turns it into a post. Again, I might use a for loop, but there’s an even better way:

curl -s 'https://steamcommunity.com/id/jlericson/recommended?p=[1-5]' \
| pup '.title > a  attr{href}' \
| xargs -n 1 ./steam_review_import.sh

The xargs command takes each line of the output of the curl | pup pipeline and passes it to my shell script. Since the script is written to take just one value,⁵ the -n 1 option tells xargs to only pass one parameter at a time. So this spawns off one execution of the script per review URL.⁶

Obviously this is a topic I’m excited about. I’d love to write about how useful the xtrace option is or how it can be useful create shell commands to pipe into another shell instance. Instead I’ll summarize the things that make shell commands useful for prototyping:

If you already use the command line regularly, writing scripts doesn’t demand much more from you.
There’s a command for just about any specialized task you can imagine. Often the command is just a wrapper for a library used in another programming language.
Having a tight read-eval-print loop makes finding and fixing mistakes much quicker.
Piping intermediate data into a tool like less simplifies the process of writing code to process that data.
Pipes make chaining commands easy and potentially quite powerful.

Even with the disadvantage of need to work around quoting oddities, I find prototyping in shell to be quicker than in other languages. Quite often I find rewriting is another language goes smoother because of insight into my data. For some tasks (editing audio, for instance) I don’t see much advantage to using another language at all.

Will I continue to write prototypes on the command line? Yes I shell.

In particular, text manipulation. It can be quite aggravating to see people struggle with manual tasks that could be solved with a one-liner.↩︎
The | character is called a pipe. It feeds the output of the first command into the input of the second command. For this post, I use the \ line continuation character to put the pipe command on a new line. It’s typical to just append a pipe to the end of a command, but using a newline is a lot clearer since you don’t have to scroll. Pipes are a useful metaphor that allows commands to be combined without the need for intermediate temporary files.↩︎
Or I somehow executed a DoS attack, which seems unlikely.↩︎
Since Markdown is allowed to contain HTML tags, both the original body and the modified result are techinically Markdown. As I write this, it occurs to me that I should have used pandoc to do a more comprehensive conversion. Next time!↩︎
There’s an easyish way around this. Just add a loop to the script:
```
for f in "$@"; do
    # Use $f instead of $1
done
```
That way we can drop the -n 1 option and xargs will pass all 42 files to the script. Another option is to use a stdin as a fallback:
```
for f in ${*:-`cat`}; do
    # Use $f instead of $1
done
```
That way we can pipe into the script without using xargs at all.↩︎
Sorry about all the footnotes, but there are so many cool things you can do with xargs. Since you are spawning all these independent processes, you can do a bit of parallel processing with the -P command. For maximum throughput, use the number of cores your machine has available.↩︎