00. R <- 数据类型

R拥有许多用于存储数据的对象类型,包括标量、向量、矩阵、数组、数据框和列表。它们在存储数据的类型、创建方式、结构复杂度,以及用于定位和访问其中个别元素的标记等方面均有所不同。下图给出了这些数据结构的一个示意图。

image-20200810205317091

在R中,对象(object)是指可以赋值给变量的任何事物,包括常量、数据结构、函数,甚至图形。对象都拥有某种模式,描述了此对象是如何存储的,以及某个,像print这样的泛型函数表明如何处理此对象。

与其他标准统计软件(如SAS、SPSS和Stata)中的数据集类似,数据框(data frame)是R中用于存储数据的一种结构:列表示变量,行表示观测。在同一个数据框中可以存储不同类型(如数值型、字符型)的变量。数据框将是你用来存储数据集的主要数据结构。

因子(factor)是名义型变量或有序型变量。

一、向量

向量是用于存储数值型、字符型或逻辑型数据的一维数组。执行组合功能的函数c()可用来创建向量。

创建

例:

1
2
3
4
5
6
7
8
9
> a <- c(1, 2, 5, 3, 6, -2, 4) 
> b <- c("one", "two", "three")
> c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
> a
[1] 1 2 5 3 6 -2 4
> b
[1] "one" "two" "three"
> c
[1] TRUE TRUE TRUE FALSE TRUE FALSE

索引

通过在方括号中给定元素所处位置的数值,我们可以访问向量中的元素。例如,a[c(2, 4)]用于访问向量a中的第二个和第四个元素。

1
2
3
4
5
6
7
> a <- c("k", "j", "h", "a", "c", "m") 
> a[3]
[1] "h"
> a[c(1, 3, 5)]
[1] "k" "h" "c"
> a[2:6]
[1] "j" "h" "a" "c" "m"

二、矩阵

创建

矩阵是一个二维数组,只是每个元素都拥有相同的模式(数值型、字符型或逻辑型)。可通过函数matrix()创建矩阵。一般使用格式为:

1
2
3
Object <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, 
byrow=logical_value,
dimnames=list(char_vector_rownames, char_vector_colnames))

Usage:

1
2
3
4
5
6
7
8
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)

as.matrix(x, ...)
## S3 method for class 'data.frame'
as.matrix(x, rownames.force = NA, ...)

is.matrix(x)
参数 描述
data 包含矩阵的元素
an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.
nrow 指定行的维数(指定目标矩阵有多少行)
the desired number of rows.
ncol 指定列的维数(指定目标矩阵有多少列)
the desired number of columns.
byrow logical. 按行填充(byrow=TRUE); 按列填充(byrow=FALSE
If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
dimnames 包含了可选的、以字符型向量表示的行名和列名
A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.
使用方式为:dimnames=list(rnames, cnames)
x an R object.

例1,创建一个5×4的矩阵:

1
2
3
4
5
6
7
8
> y <- matrix(1:20, nrow=5, ncol=4)
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

例2,创建一个按行/列填充的2×2矩阵:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> cells <- c(1,26,24,68) 
> rnames <- c("R1", "R2")
> cnames <- c("C1", "C2")

# 按行填充
> mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))
> mymatrix
C1 C2
R1 1 26
R2 24 68

# 按列填充
> mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE,
dimnames=list(rnames, cnames))
> mymatrix
C1 C2
R1 1 24
R2 26 68

索引

可以使用下标和方括号来选择矩阵中的行、列或元素。X[i,]指矩阵X中的第i行,X[,j]指第j列,X[i, j]指第i行第j 个元素。选择多行或多列时,下标ij可为数值型向量

例:

1
2
3
4
5
6
7
8
9
10
11
12
> x <- matrix(1:10, nrow=2)
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10

> x[,2]
[1] 3 4
> x[2,]
[1] 2 4 6 8 10
> x[1, c(2,4)]
[1] 3 7

矩阵都是二维的,和向量类似,矩阵中也仅能包含一种数据类型。当维度超过2时,不妨使用数组

三、数组

创建

数组(array)与矩阵类似,但是维度可以大于2。数组可通过array()函数创建,形式如下:

1
Object <- array(vector, dimensions, dimnames)

Usage:

1
2
3
array(data = NA, dim = length(data), dimnames = NULL)
as.array(x, ...)
is.array(x)
参数 描述
data 数组中的数据
a vector (including a list or expression vector) giving data to fill the array. Non-atomic classed objects are coerced by as.vector.
dim 是一个数值型向量,给出了各个维度下标的最大值
the dim attribute for the array to be created, that is an integer vector of length one or more giving the maximal indices in each dimension.
dimnames 可选参数、各维度名称标签的列表
either NULL or the names for the dimensions. This must a list (or it will be ignored) with one component for each dimension, either NULL or a character vector of the length given by dim for that dimension. The list can be named, and the list names will be used as names for the dimensions. If the list is shorter than the number of dimensions, it is extended by NULLs to the length required.
x an R object.

例1,创建一个数组:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2", "C3", "C4")
> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
> z
, , C1

B1 B2 B3
A1 1 3 5
A2 2 4 6

, , C2

B1 B2 B3
A1 7 9 11
A2 8 10 12

, , C3

B1 B2 B3
A1 13 15 17
A2 14 16 18

, , C4

B1 B2 B3
A1 19 21 23
A2 20 22 24

索引

数组的索引与矩阵类似,只不过扩展到了更高的维度,以此类推即可。

四、数据框

创建

由于不同的列可以包含不同模式(数值型、字符型等)的数据,数据框的概念较矩阵来说更为一般。

数据框可通过函数data.frame()创建:

1
Object <- data.frame(col1, col2, col3,...)

Usage:

1
2
3
4
5
6
7
data.frame(..., row.names = NULL, 
check.rows = FALSE,
check.names = TRUE,
fix.empty.names = TRUE,
stringsAsFactors = default.stringsAsFactors())

default.stringsAsFactors() # << this is deprecated !
参数 描述
列向量,可为任何类型(如字符型、数值型或逻辑型)
these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.
row.names NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
check.rows if TRUE then the rows are checked for consistency of length and names.
check.names logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
fix.empty.names logical indicating if arguments which are “unnamed” (in the sense of not being formally called as someName = arg) get an automatically constructed name or rather name “”. Needs to be set to FALSE even when check.names is false if “” names should be kept.
stringsAsFactors logical: should character vectors be converted to factors? The ‘factory-fresh’ default has been TRUE previously but has been changed to FALSE for R 4.0.0. Only as short time workaround, you can revert by setting options(stringsAsFactors = TRUE) which now warns about its deprecation.

每一列的名称可由函数names()指定

例,创建一个数据框,并将第一列重命名:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")

# 创建数据框
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor

# 修改第一列名称
> names(patientdata)[1] <- "Patients ID Number"
> patientdata
Patients ID Number age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor

索引

选取数据框中元素的方式有若干种。你可以使用前述(如矩阵中的)下标记号,亦可直接指定列名。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 直接使用下标记号
> patientdata[1:2]
patientID age
1 1 25
2 2 34
3 3 28
4 4 52

# 使用名称
> patientdata[c("diabetes", "status")]
diabetes status
1 Type1 Poor
2 Type2 Improved
3 Type1 Excellent
4 Type1 Poor

> patientdata$age
[1] 25 34 28 52

五、因子

创建

名义型变量是没有顺序之分的类别变量。糖尿病类型Diabetes(Type1、Type2)是名义型变量的一例。即使在数据中Type1编码为1而Type2编码为2,这也并不意味着二者是有序的。

有序型变量表示一种顺序关系,而非数量关系。病情Status(poor、improved、excellent)是顺序型变量的一个上佳示例。

类别(名义型)变量和有序类别(有序型)变量在R中称为因子(factor)。因子在R中非常重要,因为它决定了数据的分析方式以及如何进行视觉呈现。

函数factor()以一个整数向量的形式存储类别值,整数的取值范围是[1...k](其中k是名义型变量中唯一值的个数),同时一个由字符串(原始值)组成的内部向量将映射到这些整数上。

举例来说,假设有向量:

1
diabetes <- c("Type1", "Type2", "Type1", "Type1")

语句diabetes <- factor(diabetes)将此向量存储为(1, 2, 1, 1),并在内部将其关联为1=Type12=Type2(具体赋值根据字母顺序而定):

1
2
3
4
> diabetes <- factor(diabetes)
> diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2

要表示有序型变量,需要为函数factor()指定参数ordered=TRUE。给定向量:

1
status <- c("Poor", "Improved", "Excellent", "Poor")

语句status <- factor(status, ordered=TRUE)会将向量编码为(3, 2, 1, 3),并在内部将这些值关联为1=Excellent2=Improved以及3=Poor :

1
2
3
4
> status <- factor(status, ordered=TRUE)
> status
[1] Poor Improved Excellent Poor
Levels: Excellent < Improved < Poor

可以通过指定levels选项来覆盖默认排序,此时各水平的赋值将为1=Poor2=Improved3=Excellent:

1
2
3
4
5
> status <- factor(status, order=TRUE, 
levels=c("Poor", "Improved", "Excellent"))
> status
[1] Poor Improved Excellent Poor
Levels: Poor < Improved < Excellent

因子影响数据分析

因子在数据分析过程当中很常用,常用来计算频数值。

不加入因子的情况,无法显示出频数值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 以向量形式输入数据
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)

# 显示对象的结构
> str(patientdata)
'data.frame': 4 obs. of 4 variables:
$ patientID: num 1 2 3 4
$ age : num 25 34 28 52
$ diabetes : chr "Type1" "Type2" "Type1" "Type1"
$ status : chr "Poor" "Improved" "Excellent" "Poor"

# 显示对象的统计概要
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Length:4 Length:4
1st Qu.:1.75 1st Qu.:27.25 Class :character Class :character
Median :2.50 Median :31.00 Mode :character Mode :character
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00

加入因子分析的情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 以向量形式输入数据
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> patientdata <- data.frame(patientID, age, diabetes, status)

# 显示对象的结构
> str(patientdata)
'data.frame': 4 obs. of 4 variables:
$ patientID: num 1 2 3 4
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3

# 显示对象的统计概要
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00

六、列表

创建

列表(list)是R的数据类型中最为复杂的一种。一般来说,列表就是一些对象(或成分,component)的有序集合。列表允许你整合若干(可能无关的)对象到单个对象名下。

某个列表中可能是若干向量、矩阵、数据框,甚至其他列表的组合。可以使用函数list()创建列表:

1
2
3
4
5
# 使用默认名称
mylist <- list(object1, object2, ...)

# 自定义命名
mylist <- list(name1=object1, name2=object2, ...)

Usage:

1
2
3
4
5
6
7
8
9
10
11
12
list(...)
pairlist(...)

as.list(x, ...)
## S3 method for class 'environment'
as.list(x, all.names = FALSE, sorted = FALSE, ...)
as.pairlist(x)

is.list(x)
is.pairlist(x)

alist(...)
参数 描述
objects, possibly named.
x object to be coerced or tested.
all.names a logical indicating whether to copy all values or (default) only those whose names do not begin with a dot.
sorted a logical indicating whether the names of the resulting list should be sorted (increasingly). Note that this is somewhat costly, but may be useful for comparison of environments.

例1,创建一个列表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> g <- "My First List"
> h <- c(25, 26, 18, 39)
> j <- matrix(1:10, nrow=5)
> k <- c("one", "two", "three")
> mylist <- list(title=g, ages=h, j, k)
> mylist
$title
[1] "My First List"

$ages
[1] 25 26 18 39

[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10

[[4]]
[1] "one" "two" "three"

索引

可以通过在双重方括号中指明代表某个成分的数字或名称来访问列表中的元素。

例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 构建数据集
> g <- "My First List"
> h <- c(25, 26, 18, 39)
> j <- matrix(1:10, nrow=5)
> k <- c("one", "two", "three")
> mylist <- list(title=g, ages=h, j, k)
> mylist
$title
[1] "My First List"

$ages
[1] 25 26 18 39

[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10

[[4]]
[1] "one" "two" "three"

> mylist[[4]]
[1] "one" "two" "three"
> mylist$title
[1] "My First List"
> mylist[["ages"]]
[1] 25 26 18 39

00. R <- 数据类型
https://zhenyumi.github.io/posts/8087115f/
作者
向海
发布于
2020年7月11日
更新于
2020年8月10日
许可协议