Getting Started With R: A Beginner’s Guide, Part 1

Why learn R?

There are a lot of tools available for doing data analytics, data science, or statistical analysis. So why should you choose R? I’ll answer that by contrasting R to some of my other favorite tools.

If you want to create data visualizations, Tableau is an amazing tool. With a few mouse clicks you can create anything from a bar chart to a heat map. The graphics it produces are beautiful and you don’t need to know any programming. Plus, if you don’t mind your work being public, it’s available at no cost. But these advantages come at a price. Tableau is not a full-featured programming language, it is primarily a tool for visualization. You can do some calculations within Tableau, but you will eventually need to solve a problem that it just can’t handle. In addition, everything you do in Tableau is done through mouse clicks. This makes it difficult to create a record of what you have done, and very difficult for someone else to duplicate your work.

SAS is arguably the gold standard of statistical software, but I may be a tiny bit biased in saying that. After all, I’m taking classes that are taught in SAS Hall. Base SAS gives you a very powerful set of analytical tools, and it can be expanded with add-on programs for working in specialized fields or doing complex graphics. Virtually any statistical, analytical, or visualization that you can conceive of can be created in SAS. The big downside here is the price. A SAS license is not inexpensive, and by the time you include a couple of the add-on packages, it can be very expensive. If you want to learn SAS, there are free courses available and they include access to a web-based version of the software, but to use it for other purposes, you’ll need a license.

Python is another alternative. It’s free, it’s powerful, and there is a lot of support available on the web. The downside here is a steep learning curve. If you aren’t already a programmer, getting started with Python can be difficult. On the other hand, it is ubiquitous in the analytics community. At a Research Triangle Analysts Unconference last spring, at least 75% of the presentations involved the use of Python. After attending that Unconference, I decided it was time for me to learn Python. If you already know another programming language, then it’s really not that hard.

A KD Nuggets Poll suggests that Python is second only to R in popularity for data science or analytics. But it’s still second. R is a full-featured programming language, it’s easy to learn, it’s powerful, and since it is so popular, there is a ton of support available.

Personally, I’d recommend learning both R and Python. There are some things that are just easier in one of them than they are in the other. For example, I find it much easier to scrape data from a website using Python than R. There are tools available to let you access Python from within an R program, and vice-versa. But I’d recommend learning R first.

Installing R and R Studio

So let’s get started. The first thing we will need to do is install the software. You’ll do that by downloading R from the Comprehensive R Archive Network, or CRAN. When you get to that page, select a mirror from which to download R. Just scroll through the list until you find your country, and select a server that is close to you. Once you select a server, you’ll see a window like this one:

CRAN-Download

In the top section, choose the link that matches your operating system. On the next window, select “base.” Then click on the download link at the top of your screen, and run the installer program after it finishes downloading.

Next, install R Studio. Just click on this link, choose the free version, and follow the prompts. When you are finished, open R Studio and you should see something like this:

r-studio-screenshot

The layout of R studio is highly configurable, so after you have used it for a while you might want to change this. But for now, let’s leave it as-is.

R Studio is not R

R Studio is an “IDE,” or Integrated Development Environment, for the R programming language. You can also start R from a command prompt, or open it from the Windows start menu, and you’ll see something like this:

r-gui

Technically, everything we are going to do in this series of posts can be done directly in R. But use R Studio, because a great deal of what we are going to do is much easier in that environment.

Let’s start using R

You should have R Studio open, and your cursor should be in the Console panel, next to a “>” prompt. Type:

> 2 + 2

And you should see:

[1] 4

Now try:

> sqrt(144)

and you should get:

[1] 12

What we are doing here is using R in interactive mode. Essentially, we have just turned your very expensive computer into a $5 calculator!

But we know that R can do a lot more than that. Let’s create a variable, and assign a vector to that variable:

> x <- c(22, 3, 5, 6, 25, 12, 15, 8, 9, 7)

You’ll notice that nothing appears in your console pane except for a new “>” prompt, but now there is some information in the Environment pane that wasn’t there before. Let me explain what we just did.

“<-“ is the assignment operator in the R language. It assigns a value to a variable. The equal sign also works as an assignment operator in R, but just don’t use it, please. It will work, but it will cause a lot of confusion later on. So get in the habit of using “<-“. In R Studio, you can use Alt-minus (hold down the alt key and press the minus sign key). This keyboard shortcut will insert the “<-“ string.

“c()” is an R function which takes multiple arguments and combines them into a vector or a list – in this case, a vector.

When I look at the above line of code, I hear in my head, “x takes the vector 22, 3, …”

In the Environment pane you should see:

vector

This shows that the variable “x” is a numeric variable, it contains 10 elements, and it lists each of those elements.

Now try:

mean(x)

and you should get:

[1] 11.2

which is the average of the values in x.

Try:

> summary(x)

and see what you get.

Now let’s try a couple of graphs. Enter these commands and see what happens in the “Plot” window.

> boxplot(x)
> hist(x)

That’s all for today. Next week, we will talk about some of the special features of R Studio, and organizing your work. We will also talk about version control, and tools for sharing your work online.

My first computer

elf-ii-top
Photo courtesy of Dan Veeneman

Sometime in the late 1970’s I bought my first computer, an ELF II from Netronics. It arrived at the house in a big, padded envelope. Inside the envelope was an empty printed circuit board, and several little plastic bags containing the electronics. I spent a Saturday afternoon with a soldering iron assembling and testing, and then proceeded to try to figure out what I would do with it.

This thing was certainly no supercomputer. It had a whopping 256 bytes of memory. Not megabytes, not kilobytes; just bytes. It featured an RCA COSMAC 1802 8-bit microprocessor. This chip’s main claim to fame was that it was designed to be highly resistant to radiation and electrostatic shock, which led to it being selected by NASA for a number of space projects. To a geeky teenager in the 70’s this gave it a lot of appeal, even though my little board was never going to be exposed to cosmic radiation.

The basic $99 kit was strictly a learning project. It included a two-digit hexadecimal display and a 16-key hexadecimal keypad, plus an Interrupt key and three toggle switches. Programming it consisted of using the toggle switches to step through memory, entering a hex code corresponding to a machine-language instruction, and then stepping to the next memory location. This was tedious, to say the least. Creating a program that did anything useful, or even interesting, in 256 bytes of memory was quite a challenge.

One of the most exciting features was the expansion bus. The board had room for five expansion sockets, and came with one socket plus a few SIP headers which could be used to connect to certain input/output signals. My first project with the ELF consisted of connecting a speaker to one of these sockets. By rapidly toggling the state of the output pin, I could produce tones from the speaker. By changing the speed of this toggling, I could produce tones of different frequencies. My crowning achievement used the interrupt feature of the chip to produce a tone only while a key was pressed, with a different tone for each key. That’s right, I created a primitive music synthesizer capable of producing sixteen different notes! Eat your heart out, Moog!

Just to give you an idea of what it was like programming one of these, here’s a code snippet:

 0000 71        RESET   DIS                     ; DISABLE INTERRUPTS  
 0001 00                DC      0  
 0002 F8FF      FINDRAM LDI     0FFH            ; FIND RAM, STARTING AT FFFF  
 0004 B4                PHI     R4  
 0005 F8FF      TRYAGAIN LDI    0FFH            ; REPEAT...  
 0007 A4                PLO     R4              ; - TEST TOP BYTE ON PAGE  
 0008 54                STR     R4              ; - STORE 'FF'  
 0009 04                LDN     R4              ;   READ IT BACK,  
 000A FBFF              XRI     0FFH            ;   COMPARE  
 000C C6                LSNZ                    ; - IF OK, STORE ALL 0'S,  
 000D 54                STR     R4              ;   READ BACK,  
 000E 04                LDN     R4              ;   COMPARE  
 000F 321A              BZ      RAMFOUND        ; - IF OK, THEN RAM FOUND  
 0011 94                GHI     R4              ; - IF NO MORE PAGES TO TEST,  
 0012 32DD              BZ      NORAM           ;      THEN GO TO NORAM  
 0014 A4                PLO     R4              ;      ELSE DEC. PAGE NUMBER  
 0015 24                DEC     R4  
 0016 84                GLO     R4  
 0017 B4                PHI     R4              ; ...UNTIL DONE

Keep in mind that one doesn’t see all of this on your display. You start at memory location 0000, enter “71” on the keypad, flick a toggle switch to enter, flick another toggle switch to advance to memory location 0001, enter “00”, etc.

The expansion bus allowed much more, though. Netronics offered an expansion board that included an RS-232 serial port, a cassette interface for storing programs on tape, and a system monitor program that would display memory on a TV set or monitor. Other accessories included a prototyping board, a 4k RAM expansion, and a Tiny BASIC interpreter. Users found a wide variety of creative ways to use these, as documented in the Netronics newsletter. In those pre-Internet days, the newsletter was the only way of keeping up with the activities in the ELF community. Those prototyping features made the ELF II a predecessor of products like the Arduino and the Raspberry Pi.

When I went looking for information online to illustrate this post, I was surprised to find that the COSMAC lives on, in the form of an active retro computing community. If you want to learn more, a good starting point is http://www.cosmacelf.com/.  If you want to get in on the fun, a modern version of the original ELF computer is available, designed to fit in an Altoids tin.

Dan Veeneman also has a lot of information at his Decode Systems website. I’d like to thank Dan for giving me permission to use the photo at the top of this post.

1 2