This week I added some emojis for College Confidential members to use when reacting to other people's posts. To help me decide which emojis to add, I wrote a little query to see which emojis are most commonly used in our Discourse posts:

-- [params]
-- text :emoji_regex = :[a-z]+:|[^A-z]-\)

select emoji,
       count(*)
from (SELECT substring(raw from :emoji_regex) emoji
      from posts
      TABLESAMPLE system(1)
      where length(substring(raw from :emoji_regex)) < 50
            and user_id > 0
     ) emoji_sample
group by emoji
order by count(*) desc;

I'm kinda proud of the query. Let's break it down line-by-line:

-- [params]

This line is interpreted as a comment by PostgreSQL because it starts with --. Everything after the dashes is ignored by the SQL interpreter. But this particular comment is interpreted by the Discourse Data Explorer as the start of a list of parameters to the query. These are variables that can be filled in at run time.

-- text :emoji_regex = :[a-z]+:|[^A-z]-\)

I'm only using one variable—:emoji_regex. It's just the regex I'm using to detect emojis in posts. I'm making it a parameter because I need to use it twice and I didn't want to have to edit it in two places. The regex itself can be divided into two parts separated by |, which works like an or operator. The first part (:[a-z]+:) finds emojis like ":smiley:". The second ([^A-z]-\)) finds a subset of emoticons such as ":-)". It's far from perfect ("{-)" is probably a false positive), but it served my purpose.

select emoji,
       count(*)

When I write SQL, the select clause is often one of the last things I write. Often the first cut uses count(*) to verify that the query is producing some sort of result. Later I frequently select * so that I can get an idea of what the columns look like. In this case, I'm selecting an emoji and the count of rows that contain that emoji. It's possible to infer that there will be a group by emoji later on.

from ( ...

Normally I'd select from a table, but in this case, I'm selecting from a subquery. It's a useful technique because you can pretend that the subquery is just like a regular table, but you can set up the columns any way you need.

SELECT substring(raw from :emoji_regex) emoji

The capital SELECT is pre-populated when creating a new query in Discourse Data Explorer. I typically use lowercase in my queries, so this is an indicator of where I started working. I'm selecting the first emoji from the raw post column using the regex from earlier. If a post has several emoji, this query will only select the first one.

from posts

posts is the only real table I use in this query. Typically a query will join two or more tables. It just happens that this time I only need the raw column from posts.

TABLESAMPLE system(1)

Scanning all posts will cause the query to time out, so I'm just sampling 1% of the data. Initially I used limit 1000 in the subquery. But I wanted a random sample, so I used TABLESAMPLE in the from clause.

where length(substring(raw from :emoji_regex)) < 50

I only want to get posts that have at least one emoji. I added < 50 to eliminate cases where two colons were used as regular punctuation. I could have built this into the regex instead, I suppose.

and user_id > 0

Before I added this restriction, I got a lot of ":gift:" emojis coming from an automated response. Automation frequently screws up data. By convention automated accounts have negative user_ids, so it's easy to filter them out.

) emoji_sample

This ends the subquery and gives it a name. The name doesn't matter this time, but it's required and it can be handy if you are joining another table.

group by emoji

Here's the foreshadowed group by clause.

order by count(*) desc;

Finally, it's always a good idea to order the results. Often when doing a count(*), that will be the order your care about. Often you'll be most interested in the most frequent results, so adding desc is really common.

Since you've read this far, here are the most common emojis I found in my sample:

emoji count
:-) 602
:rolleyes: 188
;-) 177
:eek: 137
:cool: 100
:smile: 83
:confused: 62
:smiley: 33
:wink: 31
:mad: 26
:frowning: 18
:joy: 17
:grinning: 17
@-) 15
:rofl: 13
:grin: 12
:blush: 10
8-) 6
:sob: 6
/-) 5
:disappointed: 5
:cry: 5
:shrug: 5
--) 4
.-) 4
:grimacing: 4
:sigh: 4
:laughing: 4
:relaxed: 3
:thinking: 3
:skull: 3
=-) 3
:lol: 3
:heart: 2
:https: 2
:v: 2
-) 2
:embarrass: 2
:herb: 2
:triumph: 2
:sunglasses: 2
:pensive: 2
:tada: 2
:pray: 2
:p: 1
:grade: 1
:rollseyes: 1
:popcorn: 1
:gift: 1
:mosquito: 1
:women: 1
:expressionless: 1
:confounded: 1
:golf: 1
:wave: 1
>-) 1
:hugs: 1
:weary: 1
:eyes: 1
:sparkles: 1
’-) 1
1-) 1
:refresh: 1
!-) 1
3-) 1
:blossom: 1
:crown: 1
:shrugs: 1
:o: 1
6-) 1
{-) 1
:shruggs: 1
:heartpulse: 1
:worried: 1
4-) 1
:anguished: 1
:fireworks: 1
:smirk: 1
:nonexistence: 1
:neutral: 1
9-) 1
:grumble: 1
:con: 1
:sweat: 1