0%

用simstudy包模拟出相似数据集(未完)

基础

使用simstudy包进行模拟有两个基本步骤。首先,用户在外部csv文件中或通过一组重复的定义语句在内部定义数据集的数据元素。其次,用户使用这些定义生成数据。

数据生成可以像横截面设计(cross-sectional design)或前瞻性队列设计(prospective cohort design)一样简单,也可以稍复杂一些。

模拟可以包括观察性或随机的治疗分配/暴露,生存数据,纵向/面板数据(longitudinal/panel data),多级/分层数据(multi-level/hierarchical data),基于特定协方差结构的相关变量的数据集,以及由任何类型的缺失模式导致的缺失数据的数据集。

  • Longitudinal Data/纵向数据:对同一组受试个体在不同时间上的重复观测。
  • Panel Data/面板数据(平行数据):指在时间序列上取多个截面,在这些截面上同时选取样本观测值所构成的样本数据。

simstudy包中模拟数据的关键是创建一系列数据定义表(data definition table),如下所示:

data definition tables

用于生成上述定义的代码

1
2
3
4
5
6
7
8
9
10
library(simstudy)
library(data.table)

def <- defData(varname = "nr", dist = "nonrandom", formula = 7, id = "idnum")
def <- defData(def, varname = "x1", dist = "uniform", formula = "10;20")
def <- defData(def, varname = "y1", formula = "nr + x1 * 2", variance = 8)
def <- defData(def, varname = "y2", dist = "poisson", formula = "nr - 0.2 * x1", link = "log")
def <- defData(def, varname = "xCat", formula = "0.3;0.2;0.5", dist = "categorical")
def <- defData(def, varname = "g1", dist = "gamma", formula = "5+xCat", variance = 1, link = "log")
def <- defData(def, varname = "a1", dist = "binary", formula = "-3 + xCat", link = "logit")
要根据这些定义创建一个简单的数据集,所有人都需要执行一个genData命令。

在此示例中,我们生成500条记录,这些记录基于def表中的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
dt <- genData(500, def)

idnum nr x1 y1 y2 xCat g1 a1
1: 1 7 11.14037 34.04765 138 3 637.50317 1
2: 2 7 11.92208 30.44868 100 1 79.01242 0
3: 3 7 16.18471 39.71205 37 3 73.57740 0
4: 4 7 15.99059 38.68228 37 1 115.78951 0
5: 5 7 10.61884 30.63140 137 3 11407.90428 1
---
496: 496 7 16.95754 38.20793 38 3 954.34293 0
497: 497 7 17.38688 46.24585 41 1 23.14826 0
498: 498 7 13.10723 38.21072 69 3 3799.37787 0
499: 499 7 13.40657 33.17099 78 1 1512.06894 0
500: 500 7 11.72582 36.90359 106 3 3903.30936 1

用重复测量模拟一项前瞻性队列研究

问题是,我们是否可以模拟一项双臂研究(对照组和治疗组),在三个时间点上重复测量:基线,1个月后和2个月后? 答案当然是of course。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
library(ggplot2)

# Define the outcome

ydef <- defDataAdd(varname = "Y", dist = "normal",
formula = "5 + 2.5*period + 1.5*T + 3.5*period*T",
variance = 3)

# Generate a 'blank' data.table with 24 observations and assign them to groups

set.seed(1234)

indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")

# Create a longitudinal data set of 3 records for each id

longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)

longData[, `:=`(T, factor(T, labels = c("No", "Yes")))]

# Let's look at the data

ggplot(data = longData, aes(x = factor(period), y = Y)) + geom_line(aes(color = T,
group = id)) + scale_color_manual(values = c("#e38e17", "#8e17e3")) + xlab("Time")