## Prepare Data for Linear Mixed-Effects Models

### Tables and Dataset Arrays

To fit a linear-mixed effects model, you must store your data in a table or dataset array. In your table or dataset array, you must have a column for each variable including the response variable. More specifically, the table or dataset array, say `tbl`, must contain the following:

• A response variable `y`

• Predictive variables `Xj`which can be continuous or grouping variables

• Grouping variables `g1`, `g2`, ..., `gR`,

where the grouping variables in `Xj` and `gr` can be categorical, logical, a character array, a string array, or a cell array of character vectors, r = 1, 2, ..., R.

You must organize your data so that each row represents an observation. And each row should contain the value of variables and the levels of grouping variables corresponding to that observation. For example, if you have data from an experiment with four treatment options, on five different types of individuals chosen randomly from a population of individuals (blocks), the table or dataset array must look like this.

BlockTreatmentResponse
11y11
12y12
13y13
14y14
.........
51y51
52y52
53y53
54y54

Now, consider a split-plot experiment, where the effect of four different types of fertilizers on the yield of tomato plants is studied. The soil where the tomato plants are planted is divided into three blocks based on the soil type: sandy, silty, and loamy. Each block is divided into five plots, where five types of tomato plants, (cherry, heirloom, grape, vine, and plum) are randomly assigned to these plots. Then, the tomato plants in the plots are divided into subplots, where each subplot is treated by one of the four fertilizers. The data from this experiment looks like:

SoilTomatoFertilizerYield
'Sandy''Plum'1104
'Sandy''Plum'2136
'Sandy''Plum'3158
'Sandy''Plum'4174
'Sandy''Cherry'157
'Sandy''Cherry'286
............
'Sandy''Vine'399
'Sandy''Vine'4117
'Silty''Plum'1120
'Silty''Plum'2115
............
'Loamy''Vine'3111
'Loamy''Vine'4105

You must specify the model you want to fit using the `formula` input argument to `fitlme`.

In general, a formula for model specification is a character vector or string scalar of the form `'y ~ terms'`. For linear mixed-effects models, this formula is in the form `'y ~ fixed + (random1|grouping1) + ... + (randomR|groupingR)'`, where `fixed` contains the fixed-effects terms and `random1, ..., randomR` contain the random-effects terms. For example, for the previous fertilizer experiment, consider the following mixed-effects model

`${y}_{imjk}={\beta }_{0}+\sum _{m=2}^{4}{\beta }_{1m}I{\left[F\right]}_{im}+\sum _{j=2}^{5}{\beta }_{2j}I{\left[T\right]}_{ij}+{b}_{0k}{S}_{k}+{b}_{0jk}{\left(S*T\right)}_{jk}+{\epsilon }_{imjk},$`

where i = 1, 2, ..., 60, the index m corresponds to the fertilizer types, j corresponds to the tomato types, and k = 1, 2, 3 corresponds to the blocks (soil). Sk represents the kth soil type, and I[F]im is the dummy variable representing level m of the fertilizer. Similarly, I[T]ij is the dummy variable representing the level j of the tomato type.

You can fit this model using the formula `'Yield ~ 1 + Fertilizer + Tomato + (1|Soil)+(1|Soil:Tomato)'`.

For detailed information on how to specify your model using formula, see Relationship Between Formula and Design Matrices.

### Design Matrices

If you cannot easily describe your model using a formula, you can create design matrices to define the fixed and random effects, and fit the model using `fitlmematrix(X,y,Z,G)`. You must create your design matrices as follows.

Fixed-effects and random-effects design matrices `X` and `Z`:

• Enter a column of 1s for the intercept using `ones(n,1)`, where n is the total number of observations.

• If `X1` is a continuous variable, then enter `X1` as it is in a separate column.

• If `X1` is a categorical variable with m levels, then there must be m – 1 dummy variables for m – 1 levels of `X1` in `X`.

For example, consider an experiment where you want to study the impact of quality of raw materials from four different providers on the productivity of a production line. If you fit a linear mixed-effects model with intercept and provider as the fixed-effects terms, intercept is the random-effects term, and you use reference contrasts coding, then you must construct your fixed- and random-effects design matrices as follows.

```D = dummyvar(provider); % Create dummy variables X = [ones(n,1) D(:,2) D(:,3) D(:,4)]; Z = [ones(n,1)];```

Because reference contrast coding uses the first provider as the reference, and the model has an intercept, you must use the dummy variables for only the last three providers.

• If there is an interaction term of predictor variables `X1` and `X2`, then you must enter a column that you form by elementwise product of the vectors `X1` and `X2`.

For example, if you want to fit a model, where there is an intercept, a continuous treatment factor, a continuous time factor, and their interaction as the fixed-effects in a longitudinal study, and time is the random-effects term, then your fixed- and random-effects design matrices should look like

```X = [ones(n,1),treatment,time,treatment.*time]; y = response; Z = [time]; ```

Grouping variables `G`:

There is one column for each grouping variable and a column of elementwise product of the grouping variables in case of a nesting.

For example, if you want to group plots (`plot`) within blocks (`block`), then you must add a column of elementwise product of `plot` by `block`. More specifically, if you want to fit a model where there is intercept and a continuous treatment factor as the fixed-effects in a split-block experiment, and the intercept and treatment are grouped by the plots nested within blocks, then the design matrices should look like this.

```X = [ones(n,1),treatment]; y = response; Z = [ones(n,1),treatment]; G = [block.*plot];```

Suppose in the earlier quality of raw materials example, the raw materials arrive in bulks, and the bulks are nested within providers. If you want to fit a linear mixed-effects model, where intercept is grouped by the bulks within providers, then your design matrices should look like this.

```D = dummyvar(provider); X = [ones(n,1) D(:,2) D(:,3) D(:,4)]; y = response; Z = ones(n,1); G = [provider.*bulks];```

In the earlier longitudinal study example, if you want to add random effects for intercept and time grouped by subjects that participated in the study, then your design matrices should look like

```X = [ones(n,1),treatment,time, treatment.*time]; y = response; Z = [ones(n,1),time]; G = subject;```

### Relation of Matrix Form to Tables and Dataset Arrays

`fitlme(tbl,formula)` and `fitlmematrix(X,y,Z,G)` are equivalent in functionality, such that

• `y` is the n-by-1 response vector.

• `X` is an n-by-p fixed-effects design matrix. `fitlme` constructs this from the expression `fixed` in `formula`.

• `Z` is an R-by-1 cell array with `Z{r}` being an n-by-q(r) random-effects design matrix constructed from the rth expression in `random` in `formula`, r = 1, 2, ..., R.

• `G` is an R-by-1 cell array with `G{r}` being an n-by-1 grouping variable, `g`r, in `formula` with M(r) levels or groups.

For example, if `tbl` is a table or dataset array containing the response variable `y`, the continuous variables `X1` and `X2`, and the grouping variable `g`, then to fit a linear mixed-effects model that corresponds to the formula expression `'y ~ X1+ X2+ (X1*X2|g)'` using `fitlmematrix(X,y,Z,G)` the input arguments must correspond to the following:

```y = tbl.y X = [ones(n,1), tbl.X1, tbl.X2] Z = [ones(n,1), tbl.X1, tbl.X2, tbl.X1.*tbl.X2] G = tbl.g```