6 R data.table symbols and operators you should know

Hi. I’m Sharon Machlis at IDG Communications
here with Episode 43 of Do More With R: 6 data.table symbols you want to know – and
a new function in the latest dev version of the package. Data table code becomes more efficient when
you know about its special symbols and functions. Here are the ones I’ll be covering: subsetting symbols
the ‘walrus’ assignment operator a counting symbol
and 3 special data.table operators Let’s get started. For this demo, I’m going to use data from
the 2019 stackoverflow developers survey, with about 90,000 responses. If you want to
follow along, you can download the data at insights dot stackoverflow dot com slash survey. First I’ll load the data dot table package.
Next, I’ll load in the data file. I can read just the first few rows of this data
with data.table’s fread() function and the nrows argument. I’ll read in 10 rows and
take a look. There’s all sorts of interesting data in
there. (If you’re wondering what the columns mean, there’s a separate file in the download
that explains them all). Now let me read in all the data. Next I’ll create a new data.table with just
a few columns to make it easier to work with and see results. A reminder that data.table
uses this basic syntax: The data.table intro tells us to read this
as “take dt, subset or reorder rows using I, calculate j, grouped by by.” Another
way to think of it is: I and j are similar to base R’s bracket ordering: rows first,
columns second. I is for operations you’d do on rows (choosing rows based on row numbers
or conditions); j is what you’d do with columns (select columns or create new columns
from calculations). Since I’m selecting columns, that goes in
the “j” spot, which means the brackets need a comma first to leave the “I” spot
empty. One of the things I like about data.table
is that it’s easy to select columns either quoted or unquoted. Unquoted is often more
convenient (that’s usually the tidyverse way). But quoted is useful if you’re using
data.table inside your own functions, or you want to pass in a vector you created somewhere
else in your code. You can select data.table columns the typical
base R way, with a conventional vector of quoted column names. If you want to use them
unquoted, create a list instead of a vector and you can pass in the unquoted names. And now we come to special symbol #1: Instead
of typing out list(), you can just use a dot. That dot is a shortcut for list() inside data.table
brackets. What if you want to use an already-existing
vector of column names? Putting the vector object name inside data.table brackets won’t
work. Here I’ll create a vector with quoted column names and put that object in a data.table.
Like I said, it doesn’t work. Instead, you need to put dot-dot in front of that vector
object name. Why two dots? That seemed kind of random to me until I read the explanation:
Think of it similar to two dots in a Unix command-line terminal. That moves you up one
directory. Here, you’re moving up one namespace, from the environment inside data.table brackets
up to the global environment. (That really does help me remember it!) On to the next symbol. To count by group,
you can use data.table’s .N symbol. .N stands for “number of rows”. It can be the total
number of rows, or number of rows per group if you’re aggregating in the “by” section. That first line of code is the total number
of rows in the data.table. The second line is the number of rows grouped by one variable:
whether people in the survey also code as a hobby. You can use the plain column name
if there’s just one variable. If you want to group by two or more variables, use the
. list symbol, like in the 3rd line of code. Here I group by hobbyist and whether they
contribute to open source. It might be nice to order that grouped first
by hobbyist answer and then from most to least. Final line of code does that. Notice that
you can chain brackets with data.table — but without a special symbol like a pipe. Just
one set of brackets after another. The .N symbol automatically generates a column named
N (of course you can rename it if you want). So here is the data.table code. As I learn
data.table code, I find it helpful to read it step by step. So I’d read this as “For
all rows in mydt (since there’s nothing in the “I” spot), count number of rows,
grouping by Hobbyist and OpenSourcer. Then order first by Hobbyist and then number of
rows descending.” That’s equivalent to this dplyr code. Next let’s add some columns to the table,
and learn a couple of very handy data.table features. I’d like add columns to see if
each respondent uses R, if they use Python, and if they use both or neither. The LanguageWorkedWith
column has the info of languages used, and it looks like this. Each answer is a single
character string. Most have multiple languages separated by a semicolon. As always, it’s easy to search for Python
but a little more complicated to search for R, since you can’t just search for a capital
R in the string. You can see that Ruby and Rust also contain a capital R. Let’s first
create a column for Python use. Here’s what data.table code looks like to
create a TRUE/FALSE vector that checks if each string in LanguageWorkedWith contains
“Python”. If you know SQL, you’ll recognize that “like” syntax. I, well, like %like%,
it’s a nice streamlined way to check for pattern matching. The function documentation
says it’s meant to be used inside data.table brackets, but actually you can use it in any
of your code, not just with data.tables. I checked with data.table creator Matt Dowle,
he said the advice to use it inside the brackets is because some extra performance optimization
happens there. Now here’s the code to add a column called
PythonUser to the data.table. Notice that “colon equals” operator. Python has an
operator like that, too, and ever since I heard it called the “walrus operator,”
that’s what I call it. I think it’s officially “assignment by reference,” and let me
show you why. Notice at the top right, dt1 has 6 columns. Let me run the code and update
my list of environmental variables by clicking refresh. Now dt1 has 7 columns. The walrus
operator changed the existing object without needing to save it to a new variable.
To search for R I’ll use a regular expression. This says find a pattern that starts with
a word boundary – that’s the slash slash b – then a capital R, then ends with a word
boundary. (I can’t just look for R semicolon because the last item in each string doesn’t
have a semicolon). If you wanted to add both columns at once
and change the original object, you need to turn that walrus operator into a function
by backquoting it, like this second code block. There are a couple of other operators worth
knowing %between% has this syntax. So if you wanted to filter for all responses where compensation
was between 50,000 and 100,000 paid in US dollars here’s the code. The part in red
is the between condition. This operator includes both the lower and upper values when it checks. Another special operator is %chin%. It works
like base R’s %in% but is optimized for speed, and is for character vectors only.
So, if you wanted to filter for all rows where the OpenSourcer column was either “Never”
or “Less than once per year” using c.h.in would look like this. This is pretty similar
to base R except that base R needs to specify the data frame inside the bracket and also
needs a comma after the filter expression. For our final couple of symbols – and the
new dev-version fcase() function – I’m going to start by creating a new data.table
with just people who reported compensation in US dollars. That’s the first line of
code. Next, I’m going to create a new column called
“Language” for whether someone uses just R, just Python, both, or neither. And I’ll
use the new fcase() function. At the time I recorded this, fcase() was only available
in data.table’s development version. Fcase() is similar to SQL’s CASE WHEN statement
and dplyr’s case_when() function. If you look at the second code block, you can see
the syntax: condition, comma, value, comma, next condition, comma, next value, comma,
and so on. I’m setting values for 4 conditions – use R not Python, use Python, not R, use
both, use neither. And I’m using the “walrus operator” to update my existing usd data.table.
When I run that code I get a new column in my data.table. Unlike with a data frame, the
data.table structure doesn’t automatically update in the top right RStudio pane; you
do have to refresh to see changes. In the next line of code, I calculate a new
column, AvgSalary, with a regular equals sign so I’m not altering my existing table. And,
I’m grouping by Language. There are a few other symbols I won’t be
covering in this video. You can find a list of them in the “special symbols” data.table
help file. One of the most useful, .SD, has its own separate video at That’s it for this episode, thanks for watching!
For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R,
all lowercase except for the R. You can also find the Do More With R playlist on the YouTube
IDG Tech Talk channel — where you can subscribe so you never miss an episode. Hope to see
you next time!

Leave a Reply

Your email address will not be published. Required fields are marked *