Programming with R
Data types and structures
Learning Objectives {.objectives}
- Expose learners to the different data types in R
- Learn how to create vectors of different types
- Be able to check the type of vector
- Learn about missing data and other special values
- Getting familiar with the different data structures (lists, matrices, data frames)
Understanding Basic Data Types in R
To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on those.
Very important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.
Everything in R is an object.
R has 6 (although we will not discuss the raw class for this workshop) atomic vector types.
- character
- numeric (real or decimal)
- integer
- logical
- complex
By atomic, we mean the vector only holds data of a single type.
- character:
"a","swc" - numeric:
2,15.5 - integer:
2L(theLtells R to store this as an integer) - logical:
TRUE,FALSE - complex:
1+4i(complex numbers with real and imaginary parts)
R provides many functions to examine features of vectors and other objects, for example
class()- what kind of object is it (high-level)?typeof()- what is the object’s data type (low-level)?length()- how long is it? What about two dimensional objects?attributes()- does it have any metadata?
# Example
x <- "dataset"
typeof(x)
## [1] "character"
attributes(x)
## NULL
y <- 1:10
y
## [1] 1 2 3 4 5 6 7 8 9 10
typeof(y)
## [1] "integer"
length(y)
## [1] 10
z <- as.numeric(y)
z
## [1] 1 2 3 4 5 6 7 8 9 10
typeof(z)
## [1] "double"
R has many data structures. These include
- atomic vector
- list
- matrix
- data frame
- factors
Atomic Vectors
A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types:
- atomic vectors
- lists
although the term “vector” most commonly refers to the atomic types not to lists.
The Different Vector Modes
A vector is a collection of elements that are most commonly of mode character,
logical, integer or numeric.
You can create an empty vector with vector(). (By default the mode is
logical. You can be more explicit as shown in the examples below.) It is more
common to use direct constructors such as character(), numeric(), etc.
vector() # an empty 'logical' (the default) vector
## logical(0)
vector("character", length = 5) # a vector of mode 'character' with 5 elements
## [1] "" "" "" "" ""
character(5) # the same thing, but using the constructor directly
## [1] "" "" "" "" ""
numeric(5) # a numeric vector with 5 elements
## [1] 0 0 0 0 0
logical(5) # a logical vector with 5 elements
## [1] FALSE FALSE FALSE FALSE FALSE
You can also create vectors by directly specifying their content. R will then guess the appropriate mode of storage for the vector. For instance:
x <- c(1, 2, 3)
will create a vector x of mode numeric. These are the most common kind, and
are treated as double precision real numbers. If you wanted to explicitly create
integers, you need to add an L to each element (or coerce to the integer
type using as.integer()).
x1 <- c(1L, 2L, 3L)
Using TRUE and FALSE will create a vector of mode logical:
y <- c(TRUE, TRUE, FALSE, FALSE)
While using quoted text will create a vector of mode character:
z <- c("Sarah", "Tracy", "Jon")
Examining Vectors
The functions typeof(), length(), class() and str() provide useful
information about your vectors and R objects in general.
typeof(z)
## [1] "character"
length(z)
## [1] 3
class(z)
## [1] "character"
str(z)
## chr [1:3] "Sarah" "Tracy" "Jon"
Challenge - Finding commonalities {.challenge}
Do you see a property that’s common to all these vectors above?
Adding Elements
The function c() (for combine) can also be used to add elements to a vector.
z <- c(z, "Annette")
z
## [1] "Sarah" "Tracy" "Jon" "Annette"
z <- c("Greg", z)
z
## [1] "Greg" "Sarah" "Tracy" "Jon" "Annette"
Vectors from a Sequence of Numbers
You can create vectors as a sequence of numbers.
series <- 1:10
seq(10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(from = 1, to = 10, by = 0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
Missing Data
R supports missing data in vectors. They are represented as NA (Not Available)
and can be used for all the vector types covered in this lesson:
x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")
x <- c(1+5i, 2-3i, NA)
The function is.na() indicates the elements of the vectors that represent
missing data, and the function anyNA() returns TRUE if the vector contains
any missing values:
x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
## [1] FALSE TRUE FALSE FALSE TRUE
is.na(y)
## [1] FALSE FALSE FALSE FALSE FALSE
anyNA(x)
## [1] TRUE
anyNA(y)
## [1] FALSE
Other Special Values
Inf is infinity. You can have either positive or negative infinity.
1/0
## [1] Inf
NaN means Not a Number. It’s an undefined value.
0/0
## [1] NaN
What Happens When You Mix Types Inside a Vector?
R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This conversion between modes of storage is called “coercion”. When R converts the mode of storage based on its content, it is referred to as “implicit coercion”. For instance, can you guess what the following do (without running them first)?
xx <- c(1.7, "a")
xx <- c(TRUE, 2)
xx <- c("a", TRUE)
You can also control how vectors are coerced explicitly using the
as.<class_name>() functions:
as.numeric("1")
## [1] 1
as.character(1:2)
## [1] "1" "2"
Objects Attributes
Objects can have attributes. Attributes are part of the object. These include:
- names
- dimnames
- dim
- class
- attributes (contain metadata)
You can also glean other attribute-like information such as length (works on vectors and lists) or number of characters (for character strings).
length(1:10)
## [1] 10
nchar("Software Carpentry")
## [1] 18
Matrix
In R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns.
m <- matrix(nrow = 2, ncol = 2)
m
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
dim(m)
## [1] 2 2
Matrices in R are filled column-wise.
m <- matrix(1:6, nrow = 2, ncol = 3)
Other ways to construct a matrix
m <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transforms it into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using cbind() and rbind().
x <- 1:3
y <- 10:12
cbind(x, y)
## x y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x, y)
## [,1] [,2] [,3]
## x 1 2 3
## y 10 11 12
You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:
mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE)
mdat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 11 12 13
List
In R lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.
A list is a special type of vector. Each element can be a different type.
Create lists using list() or coerce other objects using as.list(). An empty
list of the required length can be created using vector()
x <- list(1, "a", TRUE, 1+4i)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
x <- vector("list", length = 5) ## empty list
length(x)
## [1] 5
x[[1]]
## NULL
x <- 1:10
x <- as.list(x)
length(x)
## [1] 10
- What is the class of
x[1]? - What about
x[[1]]?
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
## $a
## [1] "Karthik Ram"
##
## $b
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
- What is the length of this object? What about its structure?
Lists can be extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.
A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
Elements are indexed by double brackets. Single brackets will still return a(nother) list.
Data Frame
A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.
A data frame is a special type of list where every element of the list has same length.
Data frames can have additional attributes such as rownames(), which can be
useful for annotating data, like subject_id or sample_id. But most of the
time they are not used.
Some additional information on data frames:
- Usually created by
read.csv()andread.table(). - Can convert to matrix with
data.matrix()(preferred) oras.matrix() - Coercion will be forced and not always what you expect.
- Can also create with
data.frame()function. - Find the number of rows and columns with
nrow(dat)andncol(dat), respectively. - Rownames are usually 1, 2, …, n.
Creating Data Frames by Hand
To create data frames by hand:
dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
## id x y
## 1 a 1 11
## 2 b 2 12
## 3 c 3 13
## 4 d 4 14
## 5 e 5 15
## 6 f 6 16
## 7 g 7 17
## 8 h 8 18
## 9 i 9 19
## 10 j 10 20
Useful data frame functions {.callout}
head()- shown first 6 rowstail()- show last 6 rowsdim()- returns the dimensionsnrow()- number of rowsncol()- number of columnsstr()- structure of each columnnames()- shows thenamesattribute for a data frame, which gives the column names.
See that it is actually a special list:
is.list(iris)
## [1] TRUE
class(iris)
## [1] "data.frame"
| Dimensions | Homogenous | Heterogeneous |
|---|---|---|
| 1-D | atomic vector | list |
| 2-D | matrix | data frame |
Column Types in Data Frames {.challenge}
Knowing that data frames are lists of lists, can columns be of different type?
What type of structure do you expect on the iris data frame? Hint: Use str()