Loading R Packages: library() or require()?

When I was an R newbie, I was taught to load packages by using the command library(package). In my Linear Models class, the instructor likes to use require(package). This made me wonder, are the commands interchangeable? What’s the difference, and which command should I use?

Interchangeable commands . . .

The way most users will use these commands, most of the time, they are actually interchangeable. That is, if you are loading a library that has already been installed, and you are using the command outside of a function definition, then it makes no difference if you use “require” or “library.” They do the same thing.

… Well, almost interchangeable

There are, though, a couple of important differences. The first one, and the most obvious, is what happens if you try to load a package that has not previously been installed. If you use library(foo) and foo has not been installed, your program will stop with the message “Error in library(foo) : there is no package called ‘foo’.” If you use require(foo), you will get a warning, but not an error. Your program will continue to run, only to crash later when you try to use a function from the library “foo.”

This can make troubleshooting more difficult, since the error is going to happen in, say, line 847 of your code, while the actual mistake was in line 4 when you tried to load a package that wasn’t installed.

I have come to prefer using library(package) for this reason.

There is another difference, and it makes require(package) the preferred choice in certain situations. Unlike library, require returns a logical value. That is, it returns (invisibly) TRUE if the package is available, and FALSE if the package is not.  This can be very valuable for sharing code. If you are loading a package on your own machine, you may know for sure that it is already installed. But if you are loading it in a program that someone else is going to run, it’s a good idea to check! You can do this with, for example:

if (!require(package)) install.packages('package')
library(package)

will install “package” if it doesn’t exist, and then load it.

One final note: if you look at the source code, you can see that require calls library. This suggests that it is more efficient to use library. As a practical matter, it seems unlikely that this would make a difference on any reasonably modern computer.

 

An early (1753) clinical trial

Before there were pharmaceutical companies, CRO’s, or IRB’s, a Scottish physician named James Lind conducted what is probably the first documented clinical trial. He published his results as A Treatise of the Scurvy in 1753. As an educated man of his time, Lind would have no doubt been familiar with Francis Bacon’s ideas on experimentation, and he begins his treatise by describing the setup for his experiment:

On the 20th May, 1747, I took twelve patients in the scurvy on board the Salisbury at sea. Their cases were as similar as I could have them. They all in general had putrid gums, the spots and lassitude, with weakness of their knees. They lay together in one place, being a proper apartment for the sick in the fore-hold; and had one diet in common to all, viz., water gruel sweetened with sugar in the morning; fresh mutton broth often times for dinner; at other times puddings, boiled biscuit with sugar etc.; and for supper barley, raisins, rice and currants, sago and wine, or the like.

Lind divided his patients into six cohorts: two sailors were given a quart of cider per day, two were dosed with sulfuric acid, which he refers to as “vitriol.” Two were given vinegar, and two were put on “a course of sea water.” Two were given “…the bigness of a nutmeg three times a day of an electuray [a medicinal paste] recommended by an hospital surgeon made of garlic, mustard seed, rad. raphan. [dried radish root], balsam of Peru and gum myrrh…” And finally, two patients were each given two oranges and a lemon every day for six days, when they ran out.

The consequence was that the most sudden and visible good effects were perceived from the use of the oranges and lemons; one of those who had taken them being at the end of six days fit four duty. . . The other was the best recovered of any in his condition, and being now deemed pretty well was appointed nurse to the rest of the sick.

In spite of these results, Lind continued to recommend fresh air, dry conditions, and exercise to treat scurvy. Needless to say, none of these things were available on a ship at sea in the 18th century. It would be another 42 years before the British navy supplied its ships with lemon juice.

Dr. Michael Bartholomew argues that “our modern understanding of scurvy and vitamin C has hindered our understanding of Lind’s own conception of his work and of the place within it of his clinical trials.” Lind had no idea, before or after his experiment, that there was some essential substance in citrus fruits that worked against scurvy. By taking five paragraphs of Lind’s treatise out of context, and projecting our own knowledge, we can make Lind look very modern indeed. But the sad truth is that Lind never truly understood the disorder to which he devoted his career.

 

Connecting R to an Oracle Database

R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That’s great if your existing data is in a spreadsheet, but if you have large amounts of data, it’s probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.

The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the install.packages(“RODBC”) command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you’ll need to install on OBDC driver manager.

Before you can access a database from R, you’ll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type “ODBC” in the search box. On my system, it looks like this:

As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you’ll get this box:

Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:

For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the database.

The “Description” field is optional, and again, you can put in whatever you want.

TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And “User ID” is your ID that you use with the database.

After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection Successful” message. If not, check the Service Name, User ID, and Password.

Now you are ready to connect R to the database.

Here’s the R code that you need:

# Load RODBC package
library(RODBC)

# Create a connection to the database called "channel"
channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD")

# Query the database and put the results into the data frame
# "dataframe"

 dataframe <- sqlQuery(channel, "
 SELECT *
 FROM
 SCHEMA.DATATABLE")

# When finished, it's a good idea to close the connection
odbcClose(channel)

A couple of comments about this code are in order:

First, I don’t like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:

pswd <- readline("Input Password: ")
channel <- odbcConnect("DATABASE", uid="USERNAME",  pwd=pswd)

This will enable the connection to be made without compromising the security of the password.

Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.

Finally, I should mention that R works with data that is loaded into the computer’s memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer’s memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.

Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies

 

Book Review: Seven Strategies and Ten Tactics to become a Thought Leader

Seven Strategies and Ten Tactics to become a Thought Leader, by F. Annie Pettit, PhD, FMRIA. 64 pages, $5.50 on Amazon.

Don’t be fooled by the small size or the low price of this little book. I consider it one of the most valuable in my collection, and the time I’ve spent reading it (and re-reading it, over and over) has paid dividends far beyond what I would have expected.

What does Dr. Pettit mean by a “Thought Leader?” She explains it in her introduction:

“Being a thought leader means that people have learned to seek out your advice and opinions because you have proven your insights are unique and meaningful, your expertise is trustworthy, you seek to remain at the forefront of knowledge in your field, you are open to being respectfully challenged on your opinions, and you are genuinely happy to share your knowledge with people.”

The Seven Strategies are:

  1. Recognize your expertise
  2. Focus
  3. Be genuine
  4. Be clickbait
  5. Use Your voice for good
  6. Don’t be a sales pitch
  7. Start now

And the Ten Tactics:

  1. Leverage your credentials
  2. Learn
  3. Write
  4. Speak virtually
  5. Speak in person
  6. Meet
  7. Volunteer
  8. Mentor
  9. Share
  10. Find ideas

I won’t go into more detail on any of these, because you really should read the book. But I will talk about the first Strategy, “Recognize your expertise.” You may be thinking, “I’m no expert.” Pettit has an answer for that, in large, bold, letters on the page after the title page:

You are an expert.

Everyone is an expert on something, according to Pettit, and she devotes three pages to helping you discover where your expertise lies.

Becoming a thought leader is good for your personal growth and it’s good for business. Read the book. Follow the advice. Become a thought leader.

But be warned: As a thought leader, you’ll have to think twice before you say the stupid stuff you now say so freely. If you’re like me, expect to spend some time deleting stuff from your social media after you read this book!

 

Bringing Down the Banking System: Lessons From Iceland

I was very fortunate today to attend the STOR Collquium at UNC. Gudrun Johnsen of the University of Iceland gave a talk, “If you look close enough you can see a whole lot: Data collection and analysis of the Parliamentary Investigation Commission looking into the Icelandic Banking Collapse in 2008.”

The combined bankruptcy of the three largest banks in Iceland in October 2008 is the 3rd largest bankruptcy in world history, behind Lehman Brothers and Washington Mutual.

Ms. Johnsen started with a history lesson on the Pujo investigation into the American “money trusts” in 1912-1913. This investigation revealed a system of overlapping financial networks used to dominate utilities, railroads, banking, and financial infrastructure. While the committee’s work resulted in the passage of the Federal Reserve Act and the Clayton Antitrust Act, it was severely hampered by insufficient access to data.

Fast forward to 2008, when the banking system in Iceland collapsed. The Icelandic Parliament’s Special Investigation Commission (SIC) did not have the problem of insufficient access to data. Parliament lifted all confidentiality from bank employees, government officials, and others. The SIC was given the power to issue subpoenas, and the authority to walk into any bank and examine or seize any records, in any form.

What they uncovered was astounding. The banks had grown 20-fold in size in just seven years, to the point where their outstanding loans were 20 times the countries GDP. The SIC also discovered a web of ownership, related-party lending, market manipulation, and flawed incentives. A bank would purchase a corporation, lend money to that corporation, and the corporation would then invest the money in the bank. The end result would be that the banks own shares were pledged as collateral for loans made by the bank. A number of holding companies were created to prevent any firm from having more than 50% control, which would trigger consolidation under Icelandic law. There were circular arrangements where Company A owned Company B which owned Company C which owned Company A.

So all of this money was being lent and borrowed, by entities who had no “skin in the game.” While the American banks were “too big to fail,” the Icelandic banks had grown so big, so fast, that they were “too big to save.”

Johnsen concluded her talk as she concludes her book: with the results of the investigations (top management at all three banks have been sent to prison), and some ideas for future research.

I greatly enjoyed her talk, and I’m looking forward to reading the book.

More tidyverse: using dplyr functions

This week, we return to our “Getting Started With R” series. Today we are going to look at some tools from the “dplyr” package. Hadley Wickham, the creator of dplyr, calls it “A Grammar of Data Manipulation.”

filter()

Use filter() for subsetting data by rows. It takes logical expressions as inputs, and returns all rows of your data for which those expressions are true.

To demonstrate, let’s start by loading the tidyverse library (which includes dplyr), and we’ll also load the gapminder data.

library(tidyverse)
library(gapminder)

Here’s how filter() works:

filter(gapminder, lifeExp<30)

Produces this output:

# A tibble: 2 × 6
      country continent  year lifeExp     pop gdpPercap
       <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
1 Afghanistan      Asia  1952  28.801 8425333  779.4453
2      Rwanda    Africa  1992  23.599 7290203  737.0686
> 

The pipe operator

The pipe operator is one of the great features of the tidyverse. In base R, you often find yourself calling functions nested within functions nested within… you get the idea. The pipe operator %>% takes the object on the left-hand side, and “pipes” it into the function on the right hand side.

For example:

> gapminder %>% head()
# A tibble: 6 × 6
      country continent  year lifeExp      pop gdpPercap
       <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
1 Afghanistan      Asia  1952  28.801  8425333  779.4453
2 Afghanistan      Asia  1957  30.332  9240934  820.8530
3 Afghanistan      Asia  1962  31.997 10267083  853.1007
4 Afghanistan      Asia  1967  34.020 11537966  836.1971
5 Afghanistan      Asia  1972  36.088 13079460  739.9811
6 Afghanistan      Asia  1977  38.438 14880372  786.1134
> 

This is the equivalent of saying “head(gapminder).” So far, that doesn’t seem a lot easier… but wait a bit and you’ll see the beauty of the pipe.

select()

We talked about using filter() to subset data by rows. We can use select() to do the same thing for columns:

> select(gapminder, year, lifeExp)
# A tibble: 1,704 × 2
    year lifeExp
   <int>   <dbl>
1   1952  28.801
2   1957  30.332
3   1962  31.997
4   1967  34.020
5   1972  36.088
6   1977  38.438
7   1982  39.854
8   1987  40.822
9   1992  41.674
10  1997  41.763
# ... with 1,694 more rows

Here’s the same thing, but using pipes, and sending it through “head()” to make the display more compact:

> gapminder %>%
+   select(year, lifeExp) %>%
+   head(4)
# A tibble: 4 × 2
   year lifeExp
  <int>   <dbl>
1  1952  28.801
2  1957  30.332
3  1962  31.997
4  1967  34.020
> 

We are going to be making some changes to the gapminder data, so let’s start by creating a copy of the data. That way, we don’t have to worry about changing the original data.

new_gap <- gapminder

mutate()

mutate() is a function that defines a new variable and inserts it into your tibble. For example, gapminder has GDP per capita and population; if we multiply these we get the GDP.

new_gap %>%
  mutate(gdp = pop * gdpPercap)

Note that the above code creates the new field and displays the resulting tibble; we would have had to use the “<-” operator to save the new field in our tibble.

arrange()

arrange() reorders the rows in a data frame. The gapminder data is currently arranged by country, and then by year. But what if we wanted to look at it by year, and then by country?

new_gap %>%
  arrange(year, country)
# A tibble: 1,704 × 6
       country continent  year lifeExp      pop  gdpPercap
        <fctr>    <fctr> <int>   <dbl>    <int>      <dbl>
1  Afghanistan      Asia  1952  28.801  8425333   779.4453
2      Albania    Europe  1952  55.230  1282697  1601.0561
3      Algeria    Africa  1952  43.077  9279525  2449.0082
4       Angola    Africa  1952  30.015  4232095  3520.6103
5    Argentina  Americas  1952  62.485 17876956  5911.3151
6    Australia   Oceania  1952  69.120  8691212 10039.5956
7      Austria    Europe  1952  66.800  6927772  6137.0765
8      Bahrain      Asia  1952  50.939   120447  9867.0848
9   Bangladesh      Asia  1952  37.484 46886859   684.2442
10     Belgium    Europe  1952  68.000  8730405  8343.1051
# ... with 1,694 more rows

group_by() and summarize()

The group_by() function adds grouping information to your data, which then allows you to do computations by groups. The summarize() function is a natural partner for group_by(). summarize() takes a dataset with n observations, calculates the requested summaries, and returns a dataset with 1 observation:

my_gap %>%
  group_by(continent) %>%
  summarize(n = n())

The functions you’ll apply within summarize() include classical statistical summaries, like mean(), median(), var(), sd(), mad(), IQR(), min(), and max(). Remember they are functions that take nn inputs and distill them down into 1 output.

new_gap %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))
# A tibble: 5 × 2
  continent avg_lifeExp
     <fctr>       <dbl>
1    Africa    48.86533
2  Americas    64.65874
3      Asia    60.06490
4    Europe    71.90369
5   Oceania    74.32621

A wondrous example

To fully appreciate the wonders of the pipe command and the dplyr data manipulation commands, take a look at this example. It comes from Jenny Brian‘s excellent course, STAT545, at the University of British Columbia (to whom I owe a debt for much of the information included in this series of blog posts).

new_gap %>%
  select(country, year, continent, lifeExp) %>%
  group_by(continent, country) %>%
  ## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
  ## positive means lifeExp went up, negative means it went down
  mutate(le_delta = lifeExp - lag(lifeExp)) %>% 
  ## within country, retain the worst lifeExp change = smallest or most negative
  summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>% 
  ## within continent, retain the row with the lowest worst_le_delta
  top_n(-1, wt = worst_le_delta) %>% 
  arrange(worst_le_delta)
Source: local data frame [5 x 3]
Groups: continent [5]

  continent     country worst_le_delta
     <fctr>      <fctr>          <dbl>
1    Africa      Rwanda        -20.421
2      Asia    Cambodia         -9.097
3  Americas El Salvador         -1.511
4    Europe  Montenegro         -1.464
5   Oceania   Australia          0.170

To quote Jenny: “Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.”

 

Calculating required sample size in R and SAS

Today we are going to digress from our ongoing “Intro to R” series, and talk about a subject that’s been on my mind lately: sample sizes.

An important question when designing an experiment is “How big a sample do I need?” A larger sample will give more accurate results, but at a cost. Use too small a sample, and you may get inconclusive results; too large a sample, and you’re wasting resources.

To calculate the required sample size, you’ll need to know four things:

  1. The size of the response you want to detect
  2. The variance of the response
  3. The desired significance level
  4. The desired power

Delta

Suppose you are comparing a treatment group to a placebo group, and you will be measuring some continuous response variable which, you hope, will be affected by the treatment. We can consider the mean response in the treatment group, μ1, and the mean response in the placebo group, μ2. We can then define Δ = μ1 – μ2. The smaller the difference you want to detect, the larger the required sample size.

Variance

Of the four variables that go into the sample size calculation, the variance of the responses can be the most difficult to determine. Usually, before you do your experiment, you don’t know what variance to expect. Investigators often conduct a pilot study to determine the expected variance, or information from a previous published study can be used.

The effect size combines the minimal relevant difference and the variability into one measurement, Δ/σ.

Significance

Significance is equal to 1 – α, where α is the probability of making a Type 1 Error. That is, alpha represents the chance of a falsely rejecting H0 and picking up a false-positive effect. Alpha is usually set at 0.05, for a 95% significance.

Power

The power of a test is 1-β, where beta is the probability of a Type 2 error (failing to reject the null hypothesis when the alternative hypothesis is true). In other words, if you have a 20% chance of failing to detect a real difference, then the power of your test is .8.

Sample Size Calculation

The calculation for the total sample size is:

 n=\frac{4(Z_{\alpha}+Z_{\beta})^2\sigma^2}{\Delta^2}

For a two-sided test, we use Zα/2 instead of Zα.

For example, suppose we want to be able to detect a difference of 20 units, with 90% power using a two-sided t-test, and a .05 significance level. We are expecting, based on previous research, that the standard deviation of the responses will be about 60 units.

 

In this example, α=.05, β=.10, Δ=20, and σ=60. Zα/2=1.96, and Zβ=1.28. So we have:

n=\frac{4(1.96+1.28)^2\sigma^2}{20^2}\approx 378

or, about 189 for each treatment group.

Sample Size in R

You could write a function in R to do the above calculation, but fortunately, you don’t need to. The pwr library has done it for you. In this case, we will use the pwr.t.test() function.

pwr.t.test(n = , d = , sig.level = , power = , type = c(“two.sample”, “one.sample”, “paired”))

In this case, we will leave out the “n=” parameter, and it will be calculated by R. If we fill in a sample size, and use “power = NULL”, then it will calculate the power of our test.

In this equation, d is the effect size, so we will calculate that from our delta and sigma values. In R, it looks like this:

> delta <- 20
> sigma <- 60
> d <- delta/sigma
> pwr.t.test(d=d, sig.level=.05, power = .90, type = 'two.sample')

    Two-sample t test power calculation

                      n = 190.0991
                      d = 0.3333333
           sig.level = 0.05
               power = 0.9
        alternative = two.sided

NOTE: n is number in *each* group

Sample Size in SAS

In SAS, we can use PROC power to do the same calculations. One difference is that PROC power requires us to enter a value for the mean of each group. Since what we are really interested in is the difference, we can enter ‘0’ for group 1, and ’20’ for group 2, so that the difference in means will be 20. We also need to enter the standard deviation, unlike R where we calculated the effect size separately. The significance level defaults to .05, so we don’t need to enter it.

proc power;
   twosamplemeans test=diff
   groupmeans = 0 | 20
   stddev = 60
   npergroup = .
   power = 0.9;
 run;

And here is the output:

 

The alert reader has, by now, noticed a discrepancy: when we manually calculated the desired sample size, we got 189 per group. R gave us a result of 190.091, and SAS says it’s 191. Why? The simple answer is that neither program is using the above formula. pwr.t.test in R uses the uniroot() function to calculate n, and SAS uses a different formula. Furthermore, SAS and R are actually giving the same result, but SAS rounds up to 191. You can’t have .091 test subjects, and you don’t want to underpower the test, so it’s proper to round up. If you really want the details, the source code for pwr.t.test is on GitHub, and the method SAS uses to calculate n is on page 4964 of the SAS/STAT User Guide.

 

 

 

Dataframes and the tidyverse

The data frame is the primary structure for working with data in R. Whenever you have data that is arranged in a spreadsheet-like fashion, the default receptacle for that data in R is the data frame. In a data frame, each column contains measurements on one variable, and each row contains measurements on one case. All of the data in a column must be of the same type (numeric, character, or logical).

R has been around for more than 20 years now, and some things that worked well 20 years ago are less than ideal now. Consider how your mobile phone has changed over the last 20 years:

Making changes to things as basic as data frames in R is difficult. If you change the definition of a data frame, then all of the existing R programs that use data frames would have to be re-written to use the new definition. To avoid this kind of problem, most development in R takes place in packages.

The R package “tibble” provides tools for working with an alternative version of the data frame. A tibble is a data frame, but some things have been changed to make using them a little bit easier. The tibble package is part of the tidyverse, a set of packages that provide a useful set of tools for data cleaning and analysis. The tidyverse is extensively documented in the book R For Data Science. In keeping with the open-source nature of R, that book is available free online: http://r4ds.had.co.nz/.

You can load tibble, along with the rest of the tidyverse tools, like this:

library(tidyverse)

The first time you do this, you will probably get an error message.

> library(tidyverse)
Error in library(tidyverse) : there is no package called ‘tidyverse’

In that case, you need to install tidyverse:

install.packages('tidyverse')

You only need to do this installation once, but when you start a new R session you will need to reload the package with the library() command.

Tibbles are one of the unifying features of the tidyverse, but most other R packages produce data frames. You can use the “as_tibble()” command to convert a data frame to a tibble:

> as_tibble(iris)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                  
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with 140 more rows

There are some things that happen when you load a normal data frame that don’t happen when you load a tibble. On the plus side, tibble() doesn’t change the structure of your data. The data.frame() command will convert character strings to factors, unless you remember to tell it not to do that. Tibble won’t create row names. Tibble also won’t change the names of you variables.

This last feature can seem like a bug if you aren’t expecting it. One very common way to get data into R is to import it from a CSV file. CSV files are often created from Excel spreadsheets, and the column headings on Excel spreadsheets often don’t conform to the R standards for variable names. Since tibble doesn’t change variable names, you can end up with column names that are not proper R variable names. For example, they might include spaces or not start with a letter. To refer to these names, you’ll need to enclose them in backticks. For example:

`Feb Data` #contains space

Tibbles have a nice print method that, by default, shows only the first ten rows of data, and the number of columns that will fit on a screen. This keeps you from flooding your console with data.

> irises  irises
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
                                  
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with 140 more rows

You can control the number of rows and the width of the displayed data by explicitly calling ‘print.’ ‘width = Inf’ will print all of the columns.

irises %>%
 print(n=5, width = Inf)

You can look at the structure of an object, and get an overview of it, with the str() command:

> str(irises)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 5 variables:
 $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here are some more ways to look at basic information about a tibble (or a regular data frame):

> names(irises)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length"
[4] "Petal.Width" "Species" 
>  ncol(irises)
[1] 5
> length(irises)
[1] 5
> dim(irises)
[1] 150 5
> nrow(irises)
[1] 150
Summary()

provides a statistical overview of a data set:

> summary(irises)
 Sepal.Length Sepal.Width Petal.Length 
 Min. :4.300 Min. :2.000 Min. :1.000 
 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 
 Median :5.800 Median :3.000 Median :4.350 
 Mean :5.843 Mean :3.057 Mean :3.758 
 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 
 Max. :7.900 Max. :4.400 Max. :6.900 
 Petal.Width Species 
 Min. :0.100 setosa :50 
 1st Qu.:0.300 versicolor:50 
 Median :1.300 virginica :50 
 Mean :1.199 
 3rd Qu.:1.800 
 Max. :2.500

To specify a single variable within a data frame or tibble, use the dollar sign $. R has another way of doing this, using column numbers, but using the dollar sign will make it much easier to understand your code if someone else needs to use it, or if you come back to look at it months after writing it.

> head(irises$Sepal.Length)
[1] 5.1 4.9 4.7 4.6 5.0 5.4
> summary(irises$Sepal.Length)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 4.300 5.100 5.800 5.843 6.400 7.900

To recap:

  1. Use data frames, and in particular, use the tidyverse and tibbles.
  2. Always understand the parameters of your data frame: the number of rows and columns.
  3. Understand what type of variables you have in your columns.
  4. Refer to your columns by name, using $, to make your code more readable.
  5. When in doubt, use str() on an object.

Version Control, File Sharing, and Collaboration Using GitHub and RStudio

This is Part 3 of our “Getting Started with R Programming” series. For previous articles in the series, click here: Part 1, Part 2.

This week, we are going to talk about using git and GitHub with RStudio to manage your projects.

Git is a version control system, originally designed to help software developers work together on big projects. Git works with a set of files, which it calls a “repository,” to manage changes in a controlled manner. Git also works with websites like GitHub, GitLab, and BitBucket, to provide a home for your git-based projects on the internet.

If you are a hobbyist, and aren’t working on projects with other programmers, why would you want to bother with any of this? Incorporating version control into your workflow might be more trouble than its worth, if you never have to collaborate with others, or share your files with others. But most of us will, eventually, need to do this. It’s a lot easier to do if it’s built into your workflow from the start.

More importantly, there are tremendous advantages to using the web-based sites like GitHub. At the very minimum, GitHub serves as an off-site backup for your precious program files.


Full disclosure: This is an affiliate link. If you click this link and buy this shirt, Amazon pays me.

In addition, GitHub makes it easy to share your files with others. GitHub users can fork or clone your repository. People who don’t have GitHub accounts can still browse your shared files online, and even download the entire repository as a zip file.

And finally, once you learn Markdown (which we will be doing here, very soon) you can easily create a webpage for your project, hosted on GitHub, at no cost. This is most commonly used for documentation, but it’s a simple and easy way to get on the web. Just last week, I met a young programmer who showed me his portfolio, hosted on GitHub.

OK, let’s get started!

Register a GitHub Account

First, register a free GitHub account: https://github.com. For now, just use the free service. You can upgrade to a paid account, create private repositories, join organizations, and other things, later. But one thing you should think about at the very beginning is your username. I would suggest using some variant of your real name. You’ll want something that you feel comfortable revealing to a future potential employer. Also consider that things change; don’t include your current employer, school, or organization as part of your user name.

If you’ve been following along in this series, you’ve already installed R and R Studio. Otherwise, you should do that now. Instructions are in Part 1 of this series.

Installing and Configuring Git

Next, you’ll need to install git. If you are a Windows user, install Git for Windows. Just click on the link and follow the instructions. Accept any default settings that are offered during installation. This will install git in a standard location, which makes it easy for RStudio to find it. And it installs a BASH shell, which is a way to use git from a command line. This may come in handy if you want to use git outside of R/RStudio.

LINUX users can install git through their distro’s package manager. Mac users can install git from https://git-scm.com/downloads.

Now let’s tell git who you are. Go to a command prompt (or, in R Studio, go to Tools > Shell) and type:

git config --global user.name 'Your Name'

For Your Name, substitute your own name, of course. You could use your GitHub user name, or your actual first and last name. It should be something recognizable to your collaborators, as your commits will be tagged with this name.

git config --global user.email 'you@whatever.com'

The email address you put here must be the same one you used when you signed up for GitHub.

To make sure this worked, type:

git config --global --list

and you should see your name and email address in the output.

Connect Git, GitHub, and RStudio

Let’s run through an exercise to make sure you can pull from, and push to, GitHub from your computer.

Go to https://github.com and make sure you are logged in. Then click the green “New Repository” button. Give your repository a name. You can call it whatever you want, we are going to delete this shortly. For demonstration purposes, I’m calling mine “demo.” You have the option of adding a description. You should click the checkbox that says “Initialize this repository with a README.” Then click the green “Create Repository” button. You’ve created your first repository!

Click the green “Clone or download” button, and copy the URL to your clipboard. Go to the shell again, and take note of what directory you are in. I’m going to create my repository in a directory called “tmp,” so at the command prompt I typed “mkdir ~/tmp” followed by “cd ~/tmp”.

To clone the repository on your local computer, type “git clone” followed by the url you copied from GitHub. The results should look something like this:

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp
$ git clone https://github.com/gbelton/demo.git
Cloning into 'demo'...
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.

Make this your working directory,  list its files, look at the README file, and check how it is connected to GitHub. It should look something like this:

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp
$ cd demo

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ ls
README.md

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ head README.md
# demo
geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ git remote show origin
* remote origin
  Fetch URL: https://github.com/gbelton/demo.git
  Push URL: https://github.com/gbelton/demo.git
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (up to date)

Let’s make a change to a file on your local computer, and push that change to GitHub.

echo "This is a new line I wrote on my computer" >> README.md

git status

And you should see something like this:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
 (use "git add <file>..." to update what will be committed)
 (use "git checkout -- <file>..." to discard changes in working directory)

 modified: README.md

no changes added to commit (use "git add" and/or "git commit -a")

Now commit the changes, and push them to GitHub:

git add -A
git commit -m "A commit from my local computer"
git push

Git will ask you for your GitHub username and password if you are a new user. Provide them when asked.

The -m flag on the commit is important. If you don’t include it, git will prompt you for it. You should include a message that will tell others (or yourself, months from now) what you are changing with this commit.

Now go back to your browser, and refresh. You should see the line you added to your README file. If you click on commits, you should see the one with the message “My first commit from my local computer.”

Now let’s clean up. You can delete the repository on your local computer just by deleting the directory, as you would any other directory on your computer. On GitHub, (assuming you are still on your repository page) click on “settings.” Scroll down until you see the red “Danger Zone” flag, and click on “Delete This Repository.” Then follow the prompts.

Connecting GitHub to RStudio

We are going to repeat what we did above, but this time we are going to do it using RStudio.

Once again, go to GitHub, click “New Repository,” give it a name, check the box to create a README, and create the repository. Click the “clone or download” button and copy the URL to your clipboard.

In RStudio, start a new project: File > New Project > Version Control > Git

In the “Repository URL” box, paste in the URL that you copied from GitHub. Put something (maybe “demo”) in the box for the Directory Name. Check the box marked “Open in New Session.” Then click the “Create Project” button.

And, just that easy, you’ve cloned your repository!

In the file pane of RStudio, click README.md, and it should open in the editor pane. Add a line, perhaps one that says “This line was added in R Studio.” Click the disk icon to save the file.

Now we will commit the changes and push them to GitHub. In the upper right pane, click the “Git” tab. Click the “staged” box next to README.md. Click “Commit” and a new box will pop up. It shows you the staged file, and at the bottom of the box you can see exactly what changes you have made. Type a commit message in the box at the top right, something like “Changes from R Studio.” Click the commit button. ANOTHER box pops up, showing the progress of the commit. Close it after it finishes. Then click “Push.” ANOTHER box pops up, showing you the progress of your push. It may ask you for a user name and password. When it’s finished, close it. Now go back to GitHub in your web browser, refresh, and you should see your changed README file.

Congratulations, you are now set up to use git and GitHub in R Studio!

Getting Started With R: A Beginner’s Guide, Part 2

Last week, we installed R and R Studio, and we tried out a few simple R commands in the console. But using R in interactive mode, while powerful, has some limits. Today we are going to learn how to use R as a programming language, and we will write our first R Script. But first, let’s look at how we can use R Studio to keep our work organized.

A lot of tutorials introduce these topics much later, if at all. I think it’s very important to learn how to use these organizational tools from the very beginning. Eventually, you are going to need to leave R to go do something else, and you’ll want to be able to come back to R and continue what you were doing. You will have multiple R projects going at the same time, and you’ll want to be able to keep them separated.

You’ve probably closed R studio since last week’s lesson. When you quit R, a box popped up asking “Save workspace image to ~/.Rdata?” If you choose “Yes” at this prompt, when you restart R Studio, you will see in the Environment pane the objects you created in your previous session. In that same pane, you can select the “History” tab, and see all of the commands you ran in that last session. This is not the ideal way to start, stop, and re-start your work in R, but it’s a start.

Working Directory

Your “working directory” is where R will look (by default) for any files you want to load, and where R (again, by default) will save any files that you write to disk. You can check your working directory with:

> getwd()

It’s also displayed at the top of the R Studio console.

You can change your working directory directly with the command:

> setwd("~/MyNewDirectory")

The above command assumes that there is already a directory called “MyNewDirectory,” and it is a subdirectory of you home directory. You can also change your working directory by navigating to it in the Files pane of R Studio, and then selecting “More” and “Set as Working Directory” from the Files menu.

Note well that I said you can do these things, not that you should do them. As we will see, there is a better way:

R Studio Projects

As a general rule, it’s a very good idea to keep all the files associated with a project in one place. That would include data files, R scripts, figures, analytical results, etc. And R Studio makes it very easy to accomplish this via its support for projects.

To demonstrate, let’s make a project to use for the rest of this series of tutorials. In the menu bar at the top of R Studio, click “File” then “New Project.” You’ll see this:

As you can see, you can create a new directory, or choose one that already exists on you computer. The third option, Version Control, is something we will talk about later.

If you choose “New Directory,” you will get an additional menu with three choices: Empty Project, R Package, and Shiny Web Application. Choose Empty Project. Then give your new project a name. I called mine “tutorials.”

Now let’s create an R script. An R script is a file containing a series of commands that can be executed by R; in other words, a computer program.

In R Studio, click the File menu item at the top left of the screen, then select New File, and then R Script. Or you can use the keyboard shortcut, Ctrl-Shift-N. Now the console window no longer takes up the entire left side of your window; it has been split in half. The top left pane is now labeled “Untitled1.” Click on the little picture of a floppy disk, and a dialogue box will pop up, allowing you to name script. Let’s name this one “iris.R.” By convention, the file names of R scripts end with “.R” or “.r,” and you should follow this convention unless you have a good reason to do otherwise.

Since R is primarily a tool for analyzing data, we are going to need some data! Fortunately, there are a lot of ways to get data into R, and we will look at those later. But R also has some very convenient datasets built-in. For this project, we are going to use the iris dataset which is included in R. This dataset contains four measurements of 150 flowers representing three different species of iris.

Let’s inspect the data. Type “iris” in the console window, and press Enter. You’ll see… well, you’ll see a bunch of data scroll by faster than you can tell what it is. Try this instead:

> head(iris)
 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1         5.1         3.5          1.4         0.2  setosa
2         4.9         3.0          1.4         0.2  setosa
3         4.7         3.2          1.3         0.2  setosa
4         4.6         3.1          1.5         0.2  setosa
5         5.0         3.6          1.4         0.2  setosa
6         5.4         3.9          1.7         0.4  setosa

That’s better, now we can see how the iris data is organized. Each row is an observation, and each column is a variable.

Since “head” shows us the first six rows of our data, what do you suppose would happen if you typed “tail(iris)?” Try it and see!

You can learn more about the iris data by typing “?iris”, and you will learn that iris is a data frame containing a famous dataset created by a researcher named Edgar Anderson.

But wait… we typed these commands in the console, not in our new R script. Let’s fix that! Look at your Environment window, and you’ll see another tab labeled “History.” Click that, and you’ll see all of the commands you have run during this R session, in the order that you ran them. You can select a command by clicking on it, and you can select multiple commands using Ctrl-click. Select “head(iris)” and all of the subsequent commands, then click “To Source” in the menu bar. Now the commands are there in your “iris.R” script.

Let’s plot the iris data. In the iris.R window, type “plot(iris$Petal.Length, iris$Petal.Width, main=”Edgar Anderson’s Iris Data”)” (or copy and paste it from here). When you hit Enter, the cursor moves to a new line and… nothing happens. That’s because you’ve edited the script, but not sent the command to R to be executed. To execute the command, you can put the cursor anywhere in that line and press Ctrl-R, or put the cursor in that line and click “Run” at the top of the window. You can also use your mouse to select multiple commands and then click Run, and the commands will execute in order.

Once you’ve executed that command, you’ll see the File window (in the bottom right corner of R Studio) change to the Plot window. Depending on your screen settings, you might need to click the “Zoom” button to get a good look at your plot. It’s a simple scatter plot, with petal length on the x axis, and petal width on the y axis. You can already see that there seems to be some clustering of the data. Let’s make the plots for each iris species a different color:

plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")
  [unclass(iris$Species)], main="Edgar Anderson's Iris Data")

We’ve added some stuff to our basic plot, but don’t worry about those details right now; we are going to go in depth on plotting later. But do notice that the color-coding allows us to instantly see the relationship of petal width to length for the three different species of iris. Also notice that the above two lines are a single command. R doesn’t mind if a command is broken across multiple lines in a script, it uses the () to know when it gets to the end. It’s generally a good idea to break very long commands into multiple lines to make your code easier to read.

Let’s do one more thing before we call it a day. We’ll output our nice plot to a pdf file:

dev.print(pdf, "iris_plot.pdf")

You’ll see some cryptic text in the console screen, and if you click the tab to change the Plots window to the Files window, you’ll see that there is a new file called “iris_plot.pdf” in that window. Make sure your script file is saved. Now you can exit R Studio, and when you come back, you can easily re-run the same script to recreate the same plot. Even better, you have your input data, your processing script, and your output, all in the same folder. The could be very helpful when you come back to a project months later, look at the plot, and say to yourself, “Self, how did you make that plot?”

I strongly recommend you adopt this workflow for all of your projects:

  1. Create an R Project.
  2. Keep your inputs in the project folder.
  3. Keep your processing scripts there, and run them in pieces or all at once.
  4. Save your outputs in that folder.

You can do things in R studio using your mouse, such as importing a data file by clicking on it, or saving a plot using the menu in the plot window. Don’t do that! Get in the habit of doing all of your loading, processing, and saving, in your script file. You’ll make it much easier for someone else (or even yourself, months later) to understand how a table was created, how a figure was  generated, and what transformations and calculations were done to your data.

Last week I said we would get to version control, and how to share your data and code, but we didn’t quite get there. So that will be our topic for next week.

1 2