R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That’s great if your existing data is in a spreadsheet, but if you have large amounts of data, it’s probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.
The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the install.packages(“RODBC”) command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you’ll need to install on OBDC driver manager.
Before you can access a database from R, you’ll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type “ODBC” in the search box. On my system, it looks like this:
As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you’ll get this box:
Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:
For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the database.
The “Description” field is optional, and again, you can put in whatever you want.
TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And “User ID” is your ID that you use with the database.
After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection Successful” message. If not, check the Service Name, User ID, and Password.
Now you are ready to connect R to the database.
Here’s the R code that you need:
# Load RODBC package
# Create a connection to the database called "channel"
channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD")
# Query the database and put the results into the data frame# "dataframe"
dataframe <- sqlQuery(channel, "
# When finished, it's a good idea to close the connection
A couple of comments about this code are in order:
First, I don’t like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:
This will enable the connection to be made without compromising the security of the password.
Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.
Finally, I should mention that R works with data that is loaded into the computer’s memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer’s memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.
Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies
Seven Strategies and Ten Tactics to become a Thought Leader, by F. Annie Pettit, PhD, FMRIA. 64 pages, $5.50 on Amazon.
Don’t be fooled by the small size or the low price of this little book. I consider it one of the most valuable in my collection, and the time I’ve spent reading it (and re-reading it, over and over) has paid dividends far beyond what I would have expected.
What does Dr. Pettit mean by a “Thought Leader?” She explains it in her introduction:
“Being a thought leader means that people have learned to seek out your advice and opinions because you have proven your insights are unique and meaningful, your expertise is trustworthy, you seek to remain at the forefront of knowledge in your field, you are open to being respectfully challenged on your opinions, and you are genuinely happy to share your knowledge with people.”
The Seven Strategies are:
Recognize your expertise
Use Your voice for good
Don’t be a sales pitch
And the Ten Tactics:
Leverage your credentials
Speak in person
I won’t go into more detail on any of these, because you really should read the book. But I will talk about the first Strategy, “Recognize your expertise.” You may be thinking, “I’m no expert.” Pettit has an answer for that, in large, bold, letters on the page after the title page:
You are an expert.
Everyone is an expert on something, according to Pettit, and she devotes three pages to helping you discover where your expertise lies.
Becoming a thought leader is good for your personal growth and it’s good for business. Read the book. Follow the advice. Become a thought leader.
But be warned: As a thought leader, you’ll have to think twice before you say the stupid stuff you now say so freely. If you’re like me, expect to spend some time deleting stuff from your social media after you read this book!
The combined bankruptcy of the three largest banks in Iceland in October 2008 is the 3rd largest bankruptcy in world history, behind Lehman Brothers and Washington Mutual.
Ms. Johnsen started with a history lesson on the Pujo investigation into the American “money trusts” in 1912-1913. This investigation revealed a system of overlapping financial networks used to dominate utilities, railroads, banking, and financial infrastructure. While the committee’s work resulted in the passage of the Federal Reserve Act and the Clayton Antitrust Act, it was severely hampered by insufficient access to data.
Fast forward to 2008, when the banking system in Iceland collapsed. The Icelandic Parliament’s Special Investigation Commission (SIC) did not have the problem of insufficient access to data. Parliament lifted all confidentiality from bank employees, government officials, and others. The SIC was given the power to issue subpoenas, and the authority to walk into any bank and examine or seize any records, in any form.
What they uncovered was astounding. The banks had grown 20-fold in size in just seven years, to the point where their outstanding loans were 20 times the countries GDP. The SIC also discovered a web of ownership, related-party lending, market manipulation, and flawed incentives. A bank would purchase a corporation, lend money to that corporation, and the corporation would then invest the money in the bank. The end result would be that the banks own shares were pledged as collateral for loans made by the bank. A number of holding companies were created to prevent any firm from having more than 50% control, which would trigger consolidation under Icelandic law. There were circular arrangements where Company A owned Company B which owned Company C which owned Company A.
So all of this money was being lent and borrowed, by entities who had no “skin in the game.” While the American banks were “too big to fail,” the Icelandic banks had grown so big, so fast, that they were “too big to save.”
Johnsen concluded her talk as she concludes her book: with the results of the investigations (top management at all three banks have been sent to prison), and some ideas for future research.
I greatly enjoyed her talk, and I’m looking forward to reading the book.
This week, we return to our “Getting Started With R” series. Today we are going to look at some tools from the “dplyr” package. Hadley Wickham, the creator of dplyr, calls it “A Grammar of Data Manipulation.”
Use filter() for subsetting data by rows. It takes logical expressions as inputs, and returns all rows of your data for which those expressions are true.
To demonstrate, let’s start by loading the tidyverse library (which includes dplyr), and we’ll also load the gapminder data.
Here’s how filter() works:
Produces this output:
# A tibble: 2 × 6
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Rwanda Africa 1992 23.599 7290203 737.0686
The pipe operator
The pipe operator is one of the great features of the tidyverse. In base R, you often find yourself calling functions nested within functions nested within… you get the idea. The pipe operator %>% takes the object on the left-hand side, and “pipes” it into the function on the right hand side.
> gapminder %>% head()
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
This is the equivalent of saying “head(gapminder).” So far, that doesn’t seem a lot easier… but wait a bit and you’ll see the beauty of the pipe.
We talked about using filter() to subset data by rows. We can use select() to do the same thing for columns:
We are going to be making some changes to the gapminder data, so let’s start by creating a copy of the data. That way, we don’t have to worry about changing the original data.
new_gap <- gapminder
mutate() is a function that defines a new variable and inserts it into your tibble. For example, gapminder has GDP per capita and population; if we multiply these we get the GDP.
mutate(gdp = pop * gdpPercap)
Note that the above code creates the new field and displays the resulting tibble; we would have had to use the “<-” operator to save the new field in our tibble.
arrange() reorders the rows in a data frame. The gapminder data is currently arranged by country, and then by year. But what if we wanted to look at it by year, and then by country?
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fctr> <fctr> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Albania Europe 1952 55.230 1282697 1601.0561
3 Algeria Africa 1952 43.077 9279525 2449.0082
4 Angola Africa 1952 30.015 4232095 3520.6103
5 Argentina Americas 1952 62.485 17876956 5911.3151
6 Australia Oceania 1952 69.120 8691212 10039.5956
7 Austria Europe 1952 66.800 6927772 6137.0765
8 Bahrain Asia 1952 50.939 120447 9867.0848
9 Bangladesh Asia 1952 37.484 46886859 684.2442
10 Belgium Europe 1952 68.000 8730405 8343.1051
# ... with 1,694 more rows
group_by() and summarize()
The group_by() function adds grouping information to your data, which then allows you to do computations by groups. The summarize() function is a natural partner for group_by(). summarize() takes a dataset with n observations, calculates the requested summaries, and returns a dataset with 1 observation:
The functions you’ll apply within summarize() include classical statistical summaries, like mean(), median(), var(), sd(), mad(), IQR(), min(), and max(). Remember they are functions that take nn inputs and distill them down into 1 output.
# A tibble: 5 × 2
1 Africa 48.86533
2 Americas 64.65874
3 Asia 60.06490
4 Europe 71.90369
5 Oceania 74.32621
A wondrous example
To fully appreciate the wonders of the pipe command and the dplyr data manipulation commands, take a look at this example. It comes from Jenny Brian‘s excellent course, STAT545, at the University of British Columbia (to whom I owe a debt for much of the information included in this series of blog posts).
select(country, year, continent, lifeExp) %>%
group_by(continent, country) %>%
## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
## positive means lifeExp went up, negative means it went down
mutate(le_delta = lifeExp - lag(lifeExp)) %>%
## within country, retain the worst lifeExp change = smallest or most negative
summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>%
## within continent, retain the row with the lowest worst_le_delta
top_n(-1, wt = worst_le_delta) %>%
Source: local data frame [5 x 3]
Groups: continent 
continent country worst_le_delta
<fctr> <fctr> <dbl>
1 Africa Rwanda -20.421
2 Asia Cambodia -9.097
3 Americas El Salvador -1.511
4 Europe Montenegro -1.464
5 Oceania Australia 0.170
To quote Jenny: “Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.”
Today we are going to digress from our ongoing “Intro to R” series, and talk about a subject that’s been on my mind lately: sample sizes.
An important question when designing an experiment is “How big a sample do I need?” A larger sample will give more accurate results, but at a cost. Use too small a sample, and you may get inconclusive results; too large a sample, and you’re wasting resources.
To calculate the required sample size, you’ll need to know four things:
The size of the response you want to detect
The variance of the response
The desired significance level
The desired power
Suppose you are comparing a treatment group to a placebo group, and you will be measuring some continuous response variable which, you hope, will be affected by the treatment. We can consider the mean response in the treatment group, μ1, and the mean response in the placebo group, μ2. We can then define Δ = μ1 – μ2. The smaller the difference you want to detect, the larger the required sample size.
Of the four variables that go into the sample size calculation, the variance of the responses can be the most difficult to determine. Usually, before you do your experiment, you don’t know what variance to expect. Investigators often conduct a pilot study to determine the expected variance, or information from a previous published study can be used.
The effect size combines the minimal relevant difference and the variability into one measurement, Δ/σ.
Significance is equal to 1 – α, where α is the probability of making a Type 1 Error. That is, alpha represents the chance of a falsely rejecting H0 and picking up a false-positive effect. Alpha is usually set at 0.05, for a 95% significance.
The power of a test is 1-β, where beta is the probability of a Type 2 error (failing to reject the null hypothesis when the alternative hypothesis is true). In other words, if you have a 20% chance of failing to detect a real difference, then the power of your test is .8.
Sample Size Calculation
The calculation for the total sample size is:
For a two-sided test, we use Zα/2 instead of Zα.
For example, suppose we want to be able to detect a difference of 20 units, with 90% power using a two-sided t-test, and a .05 significance level. We are expecting, based on previous research, that the standard deviation of the responses will be about 60 units.
In this example, α=.05, β=.10, Δ=20, and σ=60. Zα/2=1.96, and Zβ=1.28. So we have:
or, about 189 for each treatment group.
Sample Size in R
You could write a function in R to do the above calculation, but fortunately, you don’t need to. The pwr library has done it for you. In this case, we will use the pwr.t.test() function.
pwr.t.test(n = , d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”))
In this case, we will leave out the “n=” parameter, and it will be calculated by R. If we fill in a sample size, and use “power = NULL”, then it will calculate the power of our test.
In this equation, d is the effect size, so we will calculate that from our delta and sigma values. In R, it looks like this:
> delta <- 20
> sigma <- 60
> d <- delta/sigma
> pwr.t.test(d=d, sig.level=.05, power = .90, type = 'two.sample')
Two-sample t test power calculation
n = 190.0991
d = 0.3333333
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
Sample Size in SAS
In SAS, we can use PROC power to do the same calculations. One difference is that PROC power requires us to enter a value for the mean of each group. Since what we are really interested in is the difference, we can enter ‘0’ for group 1, and ’20’ for group 2, so that the difference in means will be 20. We also need to enter the standard deviation, unlike R where we calculated the effect size separately. The significance level defaults to .05, so we don’t need to enter it.
The alert reader has, by now, noticed a discrepancy: when we manually calculated the desired sample size, we got 189 per group. R gave us a result of 190.091, and SAS says it’s 191. Why? The simple answer is that neither program is using the above formula. pwr.t.test in R uses the uniroot() function to calculate n, and SAS uses a different formula. Furthermore, SAS and R are actually giving the same result, but SAS rounds up to 191. You can’t have .091 test subjects, and you don’t want to underpower the test, so it’s proper to round up. If you really want the details, the source code for pwr.t.test is on GitHub, and the method SAS uses to calculate n is on page 4964 of the SAS/STAT User Guide.
The data frame is the primary structure for working with data in R. Whenever you have data that is arranged in a spreadsheet-like fashion, the default receptacle for that data in R is the data frame. In a data frame, each column contains measurements on one variable, and each row contains measurements on one case. All of the data in a column must be of the same type (numeric, character, or logical).
R has been around for more than 20 years now, and some things that worked well 20 years ago are less than ideal now. Consider how your mobile phone has changed over the last 20 years:
Making changes to things as basic as data frames in R is difficult. If you change the definition of a data frame, then all of the existing R programs that use data frames would have to be re-written to use the new definition. To avoid this kind of problem, most development in R takes place in packages.
The R package “tibble” provides tools for working with an alternative version of the data frame. A tibble is a data frame, but some things have been changed to make using them a little bit easier. The tibble package is part of the tidyverse, a set of packages that provide a useful set of tools for data cleaning and analysis. The tidyverse is extensively documented in the book R For Data Science. In keeping with the open-source nature of R, that book is available free online: http://r4ds.had.co.nz/.
You can load tibble, along with the rest of the tidyverse tools, like this:
The first time you do this, you will probably get an error message.
Error in library(tidyverse) : there is no package called ‘tidyverse’
In that case, you need to install tidyverse:
You only need to do this installation once, but when you start a new R session you will need to reload the package with the library() command.
Tibbles are one of the unifying features of the tidyverse, but most other R packages produce data frames. You can use the “as_tibble()” command to convert a data frame to a tibble:
There are some things that happen when you load a normal data frame that don’t happen when you load a tibble. On the plus side, tibble() doesn’t change the structure of your data. The data.frame() command will convert character strings to factors, unless you remember to tell it not to do that. Tibble won’t create row names. Tibble also won’t change the names of you variables.
This last feature can seem like a bug if you aren’t expecting it. One very common way to get data into R is to import it from a CSV file. CSV files are often created from Excel spreadsheets, and the column headings on Excel spreadsheets often don’t conform to the R standards for variable names. Since tibble doesn’t change variable names, you can end up with column names that are not proper R variable names. For example, they might include spaces or not start with a letter. To refer to these names, you’ll need to enclose them in backticks. For example:
`Feb Data` #contains space
Tibbles have a nice print method that, by default, shows only the first ten rows of data, and the number of columns that will fit on a screen. This keeps you from flooding your console with data.
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
To specify a single variable within a data frame or tibble, use the dollar sign $. R has another way of doing this, using column numbers, but using the dollar sign will make it much easier to understand your code if someone else needs to use it, or if you come back to look at it months after writing it.
 5.1 4.9 4.7 4.6 5.0 5.4
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
Use data frames, and in particular, use the tidyverse and tibbles.
Always understand the parameters of your data frame: the number of rows and columns.
Understand what type of variables you have in your columns.
Refer to your columns by name, using $, to make your code more readable.
This is Part 3 of our “Getting Started with R Programming” series. For previous articles in the series, click here: Part 1, Part 2.
This week, we are going to talk about using git and GitHub with RStudio to manage your projects.
Git is a version control system, originally designed to help software developers work together on big projects. Git works with a set of files, which it calls a “repository,” to manage changes in a controlled manner. Git also works with websites like GitHub, GitLab, and BitBucket, to provide a home for your git-based projects on the internet.
If you are a hobbyist, and aren’t working on projects with other programmers, why would you want to bother with any of this? Incorporating version control into your workflow might be more trouble than its worth, if you never have to collaborate with others, or share your files with others. But most of us will, eventually, need to do this. It’s a lot easier to do if it’s built into your workflow from the start.
More importantly, there are tremendous advantages to using the web-based sites like GitHub. At the very minimum, GitHub serves as an off-site backup for your precious program files.
In addition, GitHub makes it easy to share your files with others. GitHub users can fork or clone your repository. People who don’t have GitHub accounts can still browse your shared files online, and even download the entire repository as a zip file.
And finally, once you learn Markdown (which we will be doing here, very soon) you can easily create a webpage for your project, hosted on GitHub, at no cost. This is most commonly used for documentation, but it’s a simple and easy way to get on the web. Just last week, I met a young programmer who showed me his portfolio, hosted on GitHub.
OK, let’s get started!
Register a GitHub Account
First, register a free GitHub account: https://github.com. For now, just use the free service. You can upgrade to a paid account, create private repositories, join organizations, and other things, later. But one thing you should think about at the very beginning is your username. I would suggest using some variant of your real name. You’ll want something that you feel comfortable revealing to a future potential employer. Also consider that things change; don’t include your current employer, school, or organization as part of your user name.
If you’ve been following along in this series, you’ve already installed R and R Studio. Otherwise, you should do that now. Instructions are in Part 1 of this series.
Installing and Configuring Git
Next, you’ll need to install git. If you are a Windows user, install Git for Windows. Just click on the link and follow the instructions. Accept any default settings that are offered during installation. This will install git in a standard location, which makes it easy for RStudio to find it. And it installs a BASH shell, which is a way to use git from a command line. This may come in handy if you want to use git outside of R/RStudio.
Now let’s tell git who you are. Go to a command prompt (or, in R Studio, go to Tools > Shell) and type:
git config --global user.name 'Your Name'
For Your Name, substitute your own name, of course. You could use your GitHub user name, or your actual first and last name. It should be something recognizable to your collaborators, as your commits will be tagged with this name.
git config --global user.email 'email@example.com'
The email address you put here must be the same one you used when you signed up for GitHub.
To make sure this worked, type:
git config --global --list
and you should see your name and email address in the output.
Connect Git, GitHub, and RStudio
Let’s run through an exercise to make sure you can pull from, and push to, GitHub from your computer.
Go to https://github.com and make sure you are logged in. Then click the green “New Repository” button. Give your repository a name. You can call it whatever you want, we are going to delete this shortly. For demonstration purposes, I’m calling mine “demo.” You have the option of adding a description. You should click the checkbox that says “Initialize this repository with a README.” Then click the green “Create Repository” button. You’ve created your first repository!
Click the green “Clone or download” button, and copy the URL to your clipboard. Go to the shell again, and take note of what directory you are in. I’m going to create my repository in a directory called “tmp,” so at the command prompt I typed “mkdir ~/tmp” followed by “cd ~/tmp”.
To clone the repository on your local computer, type “git clone” followed by the url you copied from GitHub. The results should look something like this:
Make this your working directory, list its files, look at the README file, and check how it is connected to GitHub. It should look something like this:
geral@DESKTOP-0HM18A3 MINGW64 ~/tmp
$ cd demo
geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ head README.md
geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ git remote show origin
* remote origin
Fetch URL: https://github.com/gbelton/demo.git
Push URL: https://github.com/gbelton/demo.git
HEAD branch: master
Local branch configured for 'git pull':
master merges with remote master
Local ref configured for 'git push':
master pushes to master (up to date)
Let’s make a change to a file on your local computer, and push that change to GitHub.
echo "This is a new line I wrote on my computer" >> README.md
And you should see something like this:
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
no changes added to commit (use "git add" and/or "git commit -a")
Now commit the changes, and push them to GitHub:
git add -A
git commit -m "A commit from my local computer"
Git will ask you for your GitHub username and password if you are a new user. Provide them when asked.
The -m flag on the commit is important. If you don’t include it, git will prompt you for it. You should include a message that will tell others (or yourself, months from now) what you are changing with this commit.
Now go back to your browser, and refresh. You should see the line you added to your README file. If you click on commits, you should see the one with the message “My first commit from my local computer.”
Now let’s clean up. You can delete the repository on your local computer just by deleting the directory, as you would any other directory on your computer. On GitHub, (assuming you are still on your repository page) click on “settings.” Scroll down until you see the red “Danger Zone” flag, and click on “Delete This Repository.” Then follow the prompts.
Connecting GitHub to RStudio
We are going to repeat what we did above, but this time we are going to do it using RStudio.
Once again, go to GitHub, click “New Repository,” give it a name, check the box to create a README, and create the repository. Click the “clone or download” button and copy the URL to your clipboard.
In RStudio, start a new project: File > New Project > Version Control > Git
In the “Repository URL” box, paste in the URL that you copied from GitHub. Put something (maybe “demo”) in the box for the Directory Name. Check the box marked “Open in New Session.” Then click the “Create Project” button.
And, just that easy, you’ve cloned your repository!
In the file pane of RStudio, click README.md, and it should open in the editor pane. Add a line, perhaps one that says “This line was added in R Studio.” Click the disk icon to save the file.
Now we will commit the changes and push them to GitHub. In the upper right pane, click the “Git” tab. Click the “staged” box next to README.md. Click “Commit” and a new box will pop up. It shows you the staged file, and at the bottom of the box you can see exactly what changes you have made. Type a commit message in the box at the top right, something like “Changes from R Studio.” Click the commit button. ANOTHER box pops up, showing the progress of the commit. Close it after it finishes. Then click “Push.” ANOTHER box pops up, showing you the progress of your push. It may ask you for a user name and password. When it’s finished, close it. Now go back to GitHub in your web browser, refresh, and you should see your changed README file.
Congratulations, you are now set up to use git and GitHub in R Studio!
Last week, we installed R and R Studio, and we tried out a few simple R commands in the console. But using R in interactive mode, while powerful, has some limits. Today we are going to learn how to use R as a programming language, and we will write our first R Script. But first, let’s look at how we can use R Studio to keep our work organized.
A lot of tutorials introduce these topics much later, if at all. I think it’s very important to learn how to use these organizational tools from the very beginning. Eventually, you are going to need to leave R to go do something else, and you’ll want to be able to come back to R and continue what you were doing. You will have multiple R projects going at the same time, and you’ll want to be able to keep them separated.
You’ve probably closed R studio since last week’s lesson. When you quit R, a box popped up asking “Save workspace image to ~/.Rdata?” If you choose “Yes” at this prompt, when you restart R Studio, you will see in the Environment pane the objects you created in your previous session. In that same pane, you can select the “History” tab, and see all of the commands you ran in that last session. This is not the ideal way to start, stop, and re-start your work in R, but it’s a start.
Your “working directory” is where R will look (by default) for any files you want to load, and where R (again, by default) will save any files that you write to disk. You can check your working directory with:
It’s also displayed at the top of the R Studio console.
You can change your working directory directly with the command:
The above command assumes that there is already a directory called “MyNewDirectory,” and it is a subdirectory of you home directory. You can also change your working directory by navigating to it in the Files pane of R Studio, and then selecting “More” and “Set as Working Directory” from the Files menu.
Note well that I said you can do these things, not that you should do them. As we will see, there is a better way:
R Studio Projects
As a general rule, it’s a very good idea to keep all the files associated with a project in one place. That would include data files, R scripts, figures, analytical results, etc. And R Studio makes it very easy to accomplish this via its support for projects.
To demonstrate, let’s make a project to use for the rest of this series of tutorials. In the menu bar at the top of R Studio, click “File” then “New Project.” You’ll see this:
As you can see, you can create a new directory, or choose one that already exists on you computer. The third option, Version Control, is something we will talk about later.
If you choose “New Directory,” you will get an additional menu with three choices: Empty Project, R Package, and Shiny Web Application. Choose Empty Project. Then give your new project a name. I called mine “tutorials.”
Now let’s create an R script. An R script is a file containing a series of commands that can be executed by R; in other words, a computer program.
In R Studio, click the File menu item at the top left of the screen, then select New File, and then R Script. Or you can use the keyboard shortcut, Ctrl-Shift-N. Now the console window no longer takes up the entire left side of your window; it has been split in half. The top left pane is now labeled “Untitled1.” Click on the little picture of a floppy disk, and a dialogue box will pop up, allowing you to name script. Let’s name this one “iris.R.” By convention, the file names of R scripts end with “.R” or “.r,” and you should follow this convention unless you have a good reason to do otherwise.
Since R is primarily a tool for analyzing data, we are going to need some data! Fortunately, there are a lot of ways to get data into R, and we will look at those later. But R also has some very convenient datasets built-in. For this project, we are going to use the iris dataset which is included in R. This dataset contains four measurements of 150 flowers representing three different species of iris.
Let’s inspect the data. Type “iris” in the console window, and press Enter. You’ll see… well, you’ll see a bunch of data scroll by faster than you can tell what it is. Try this instead:
That’s better, now we can see how the iris data is organized. Each row is an observation, and each column is a variable.
Since “head” shows us the first six rows of our data, what do you suppose would happen if you typed “tail(iris)?” Try it and see!
You can learn more about the iris data by typing “?iris”, and you will learn that iris is a data frame containing a famous dataset created by a researcher named Edgar Anderson.
But wait… we typed these commands in the console, not in our new R script. Let’s fix that! Look at your Environment window, and you’ll see another tab labeled “History.” Click that, and you’ll see all of the commands you have run during this R session, in the order that you ran them. You can select a command by clicking on it, and you can select multiple commands using Ctrl-click. Select “head(iris)” and all of the subsequent commands, then click “To Source” in the menu bar. Now the commands are there in your “iris.R” script.
Let’s plot the iris data. In the iris.R window, type “plot(iris$Petal.Length, iris$Petal.Width, main=”Edgar Anderson’s Iris Data”)” (or copy and paste it from here). When you hit Enter, the cursor moves to a new line and… nothing happens. That’s because you’ve edited the script, but not sent the command to R to be executed. To execute the command, you can put the cursor anywhere in that line and press Ctrl-R, or put the cursor in that line and click “Run” at the top of the window. You can also use your mouse to select multiple commands and then click Run, and the commands will execute in order.
Once you’ve executed that command, you’ll see the File window (in the bottom right corner of R Studio) change to the Plot window. Depending on your screen settings, you might need to click the “Zoom” button to get a good look at your plot. It’s a simple scatter plot, with petal length on the x axis, and petal width on the y axis. You can already see that there seems to be some clustering of the data. Let’s make the plots for each iris species a different color:
We’ve added some stuff to our basic plot, but don’t worry about those details right now; we are going to go in depth on plotting later. But do notice that the color-coding allows us to instantly see the relationship of petal width to length for the three different species of iris. Also notice that the above two lines are a single command. R doesn’t mind if a command is broken across multiple lines in a script, it uses the () to know when it gets to the end. It’s generally a good idea to break very long commands into multiple lines to make your code easier to read.
Let’s do one more thing before we call it a day. We’ll output our nice plot to a pdf file:
You’ll see some cryptic text in the console screen, and if you click the tab to change the Plots window to the Files window, you’ll see that there is a new file called “iris_plot.pdf” in that window. Make sure your script file is saved. Now you can exit R Studio, and when you come back, you can easily re-run the same script to recreate the same plot. Even better, you have your input data, your processing script, and your output, all in the same folder. The could be very helpful when you come back to a project months later, look at the plot, and say to yourself, “Self, how did you make that plot?”
I strongly recommend you adopt this workflow for all of your projects:
Create an R Project.
Keep your inputs in the project folder.
Keep your processing scripts there, and run them in pieces or all at once.
Save your outputs in that folder.
You can do things in R studio using your mouse, such as importing a data file by clicking on it, or saving a plot using the menu in the plot window. Don’t do that! Get in the habit of doing all of your loading, processing, and saving, in your script file. You’ll make it much easier for someone else (or even yourself, months later) to understand how a table was created, how a figure was generated, and what transformations and calculations were done to your data.
Last week I said we would get to version control, and how to share your data and code, but we didn’t quite get there. So that will be our topic for next week.
There are a lot of tools available for doing data analytics, data science, or statistical analysis. So why should you choose R? I’ll answer that by contrasting R to some of my other favorite tools.
If you want to create data visualizations, Tableau is an amazing tool. With a few mouse clicks you can create anything from a bar chart to a heat map. The graphics it produces are beautiful and you don’t need to know any programming. Plus, if you don’t mind your work being public, it’s available at no cost. But these advantages come at a price. Tableau is not a full-featured programming language, it is primarily a tool for visualization. You can do some calculations within Tableau, but you will eventually need to solve a problem that it just can’t handle. In addition, everything you do in Tableau is done through mouse clicks. This makes it difficult to create a record of what you have done, and very difficult for someone else to duplicate your work.
SAS is arguably the gold standard of statistical software, but I may be a tiny bit biased in saying that. After all, I’m taking classes that are taught in SAS Hall. Base SAS gives you a very powerful set of analytical tools, and it can be expanded with add-on programs for working in specialized fields or doing complex graphics. Virtually any statistical, analytical, or visualization that you can conceive of can be created in SAS. The big downside here is the price. A SAS license is not inexpensive, and by the time you include a couple of the add-on packages, it can be very expensive. If you want to learn SAS, there are free courses available and they include access to a web-based version of the software, but to use it for other purposes, you’ll need a license.
Python is another alternative. It’s free, it’s powerful, and there is a lot of support available on the web. The downside here is a steep learning curve. If you aren’t already a programmer, getting started with Python can be difficult. On the other hand, it is ubiquitous in the analytics community. At a Research Triangle Analysts Unconference last spring, at least 75% of the presentations involved the use of Python. After attending that Unconference, I decided it was time for me to learn Python. If you already know another programming language, then it’s really not that hard.
A KD Nuggets Poll suggests that Python is second only to R in popularity for data science or analytics. But it’s still second. R is a full-featured programming language, it’s easy to learn, it’s powerful, and since it is so popular, there is a ton of support available.
Personally, I’d recommend learning both R and Python. There are some things that are just easier in one of them than they are in the other. For example, I find it much easier to scrape data from a website using Python than R. There are tools available to let you access Python from within an R program, and vice-versa. But I’d recommend learning R first.
Installing R and R Studio
So let’s get started. The first thing we will need to do is install the software. You’ll do that by downloading R from the Comprehensive R Archive Network, or CRAN. When you get to that page, select a mirror from which to download R. Just scroll through the list until you find your country, and select a server that is close to you. Once you select a server, you’ll see a window like this one:
In the top section, choose the link that matches your operating system. On the next window, select “base.” Then click on the download link at the top of your screen, and run the installer program after it finishes downloading.
Next, install R Studio. Just click on this link, choose the free version, and follow the prompts. When you are finished, open R Studio and you should see something like this:
The layout of R studio is highly configurable, so after you have used it for a while you might want to change this. But for now, let’s leave it as-is.
R Studio is not R
R Studio is an “IDE,” or Integrated Development Environment, for the R programming language. You can also start R from a command prompt, or open it from the Windows start menu, and you’ll see something like this:
Technically, everything we are going to do in this series of posts can be done directly in R. But use R Studio, because a great deal of what we are going to do is much easier in that environment.
Let’s start using R
You should have R Studio open, and your cursor should be in the Console panel, next to a “>” prompt. Type:
> 2 + 2
And you should see:
and you should get:
What we are doing here is using R in interactive mode. Essentially, we have just turned your very expensive computer into a $5 calculator!
But we know that R can do a lot more than that. Let’s create a variable, and assign a vector to that variable:
> x <- c(22, 3, 5, 6, 25, 12, 15, 8, 9, 7)
You’ll notice that nothing appears in your console pane except for a new “>” prompt, but now there is some information in the Environment pane that wasn’t there before. Let me explain what we just did.
“<-“ is the assignment operator in the R language. It assigns a value to a variable. The equal sign also works as an assignment operator in R, but just don’t use it, please. It will work, but it will cause a lot of confusion later on. So get in the habit of using “<-“. In R Studio, you can use Alt-minus (hold down the alt key and press the minus sign key). This keyboard shortcut will insert the “<-“ string.
“c()” is an R function which takes multiple arguments and combines them into a vector or a list – in this case, a vector.
When I look at the above line of code, I hear in my head, “x takes the vector 22, 3, …”
In the Environment pane you should see:
This shows that the variable “x” is a numeric variable, it contains 10 elements, and it lists each of those elements.
and you should get:
which is the average of the values in x.
and see what you get.
Now let’s try a couple of graphs. Enter these commands and see what happens in the “Plot” window.
That’s all for today. Next week, we will talk about some of the special features of R Studio, and organizing your work. We will also talk about version control, and tools for sharing your work online.
Sometime in the late 1970’s I bought my first computer, an ELF II from Netronics. It arrived at the house in a big, padded envelope. Inside the envelope was an empty printed circuit board, and several little plastic bags containing the electronics. I spent a Saturday afternoon with a soldering iron assembling and testing, and then proceeded to try to figure out what I would do with it.
This thing was certainly no supercomputer. It had a whopping 256 bytes of memory. Not megabytes, not kilobytes; just bytes. It featured an RCA COSMAC 1802 8-bit microprocessor. This chip’s main claim to fame was that it was designed to be highly resistant to radiation and electrostatic shock, which led to it being selected by NASA for a number of space projects. To a geeky teenager in the 70’s this gave it a lot of appeal, even though my little board was never going to be exposed to cosmic radiation.
The basic $99 kit was strictly a learning project. It included a two-digit hexadecimal display and a 16-key hexadecimal keypad, plus an Interrupt key and three toggle switches. Programming it consisted of using the toggle switches to step through memory, entering a hex code corresponding to a machine-language instruction, and then stepping to the next memory location. This was tedious, to say the least. Creating a program that did anything useful, or even interesting, in 256 bytes of memory was quite a challenge.
One of the most exciting features was the expansion bus. The board had room for five expansion sockets, and came with one socket plus a few SIP headers which could be used to connect to certain input/output signals. My first project with the ELF consisted of connecting a speaker to one of these sockets. By rapidly toggling the state of the output pin, I could produce tones from the speaker. By changing the speed of this toggling, I could produce tones of different frequencies. My crowning achievement used the interrupt feature of the chip to produce a tone only while a key was pressed, with a different tone for each key. That’s right, I created a primitive music synthesizer capable of producing sixteen different notes! Eat your heart out, Moog!
Just to give you an idea of what it was like programming one of these, here’s a code snippet:
0000 71 RESET DIS ; DISABLE INTERRUPTS
0001 00 DC 0
0002 F8FF FINDRAM LDI 0FFH ; FIND RAM, STARTING AT FFFF
0004 B4 PHI R4
0005 F8FF TRYAGAIN LDI 0FFH ; REPEAT...
0007 A4 PLO R4 ; - TEST TOP BYTE ON PAGE
0008 54 STR R4 ; - STORE 'FF'
0009 04 LDN R4 ; READ IT BACK,
000A FBFF XRI 0FFH ; COMPARE
000C C6 LSNZ ; - IF OK, STORE ALL 0'S,
000D 54 STR R4 ; READ BACK,
000E 04 LDN R4 ; COMPARE
000F 321A BZ RAMFOUND ; - IF OK, THEN RAM FOUND
0011 94 GHI R4 ; - IF NO MORE PAGES TO TEST,
0012 32DD BZ NORAM ; THEN GO TO NORAM
0014 A4 PLO R4 ; ELSE DEC. PAGE NUMBER
0015 24 DEC R4
0016 84 GLO R4
0017 B4 PHI R4 ; ...UNTIL DONE
Keep in mind that one doesn’t see all of this on your display. You start at memory location 0000, enter “71” on the keypad, flick a toggle switch to enter, flick another toggle switch to advance to memory location 0001, enter “00”, etc.
The expansion bus allowed much more, though. Netronics offered an expansion board that included an RS-232 serial port, a cassette interface for storing programs on tape, and a system monitor program that would display memory on a TV set or monitor. Other accessories included a prototyping board, a 4k RAM expansion, and a Tiny BASIC interpreter. Users found a wide variety of creative ways to use these, as documented in the Netronics newsletter. In those pre-Internet days, the newsletter was the only way of keeping up with the activities in the ELF community. Those prototyping features made the ELF II a predecessor of products like the Arduino and the Raspberry Pi.