R – Exclude entire subjects if they lack data in one condition

27 February 2017

msauter

It can sometimes be tricky to find out and eliminate participants that produce missing data in an analysis of interest. Most often you want to do an ANOVA with 2-3 factors and you receive the error message “One more more cells are missing data”. This happens when not all variables are experimentally controlled but are subject to the participants performance. For example, you want to see how many errors (true/false) participants do over the course of an experiment (block1-10). Now if some participant did not make any errors in a given block, you will run into missing data problems in follow-up ANOVAs or paired t-tests.

I found that an easy and very understandable way of approaching this is the following:

So let’s say we want to investigate whether the amount of errors a participant made changed over the course of an experiment (which is divided in 2 blocks) with the ANOVA: Error X Block.

S	error	block
1	FALSE	1
1	TRUE	1
1	FALSE	1
1	FaLSE   2
1	TRUE	2
1	FaLSE	2
2	FaLSE	1
2	TRUE	1
2	FaLSE	1
2	FaLSE	2
2	TRUE	2
2	FaLSE	2
3	FaLSE	1
3	TRUE	1
3	FaLSE	1
3	FaLSE	2
3	FaLSE	2
3	FaLSE	2

First, we need to summarize the data, which we do using the ddply command.

# First we need to aggregate the data
library(plyr)
agg <- ddply(data, ~ S + block + error, .fun=summarise, count = length(error))

We now get for each subject, the count of errors and non-errors in each block.

# Aggregated data
S block error count
1     1 FALSE     2
1     1  TRUE     1
1     2 FALSE     2
1     2  TRUE     1
2     1 FALSE     2
2     1  TRUE     1
2     2 FALSE     2
2     2  TRUE     1
3     1 FALSE     2
3     1  TRUE     1
3     2 FALSE     3

What you can easily spot is that there are only three rows of data for participant 3. This is because he never produced errors in block 3. In larger files, we might miss that. This is why we then count the rows of data (number of conditions) for each participant.

agg_conditions <- ddply(agg, ~ S, .fun=summarise, cond = length(count))

You will get the following output:

# Number of conditions per subject
S cond
1    4
2    4
3    3

Next, instead excluding participant 3 manually, we make a list of all subjects that have a lower value of cond (less conditions) than the maximum number of conditions.

exclude <- unique(agg_conditions$S[agg_conditions$cond < max(agg_conditions$cond)])

You can now print out exclude to see a nice list of every subject ID you excluded (because you will mention that in the article of course!). Further, we automatically exclude these subjects from the dataset by the following command.

agg <- subset(agg, !(agg$S %in% exclude))

This is it. Easy, wasn’t it? And totally works for high numbers of conditions and subjects (when the manual approach is way too much workload).

Below is the full script:

# Full script: How to exclude participants that lac data in any one condition

library(plyr)
agg <- ddply(data, ~ S + block + error, .fun=summarise, count = length(error))
agg_conditions <- ddply(agg, ~ S, .fun=summarise, cond = length(count))
exclude <- unique(agg_conditions$S[agg_conditions$cond < max(agg_conditions$cond)])
agg <- subset(agg, !(agg$S %in% exclude))

R, Script, Statistics

One response

Ondrej Havlicek

28 February 2017

Your code max(agg_conditions$cond) assumes that at least one participant has data from all the combinations of conditions, no?;-)

I was recently trying to do something similar using tidyverse:
library(tidyverse)
#Assumes S, block, error are factors

exclude <- agg %>% group_by(S, block, error) %>% summarize(N=n()) %>% complete(block, error) %>% ungroup() %>% filter(is.na(N)) %>% select(S) %>% unique()

Not sure it will work universally, or that it is the most elegant solution..

I am also struggling with the actual exclusion of rows where S is in exclude, if I use S as factor (which makes sense, it is not a numeric variable), because the %in% comparison can compare the underlying factor values instead of factor labels.. Then one has to do something ugly like:

agg.selected % filter(!(as.numeric(labels(S))[S] %in% exclude))

Reply

R – Exclude entire subjects if they lack data in one condition

One response

Leave a Reply to Ondrej Havlicek Cancel reply