R – Exclude entire subjects if they lack data in one condition

It can sometimes be tricky to find out and eliminate participants that produce missing data in an analysis of interest. Most often you want to do an ANOVA with 2-3 factors and you receive the error message “One more more cells are missing data”. This happens when not all variables are experimentally controlled but are subject to the participants performance. For example, you want to see how many errors (true/false) participants do over the course of an experiment (block1-10). Now if some participant did not make any errors in a given block, you will run into missing data problems in follow-up ANOVAs or paired t-tests.

I found that an easy and very understandable way of approaching this is the following:

So let’s say we want to investigate whether the amount of errors a participant made changed over the course of an experiment (which is divided in 2 blocks) with the ANOVA: Error X Block.

First, we need to summarize the data, which we do using the ddply command.

We now get for each subject, the count of errors and non-errors in each block.

What you can easily spot is that there are only three rows of data for participant 3. This is because he never produced errors in block 3. In larger files, we might miss that. This is why we then count the rows of data (number of conditions) for each participant.

You will get the following output:

Next, instead excluding participant 3 manually, we make a list of all subjects that have a lower value of cond (less conditions) than the maximum number of conditions.

You can now print out exclude to see a nice list of every subject ID you excluded (because you will mention that in the article of course!). Further, we automatically exclude these subjects from the dataset by the following command.

This is it. Easy, wasn’t it? And totally works for high numbers of conditions and subjects (when the manual approach is way too much workload).

Below is the full script:


You may also like...

1 Response

  1. Your code max(agg_conditions$cond) assumes that at least one participant has data from all the combinations of conditions, no?;-)

    I was recently trying to do something similar using tidyverse:
    #Assumes S, block, error are factors

    exclude <- agg %>% group_by(S, block, error) %>% summarize(N=n()) %>% complete(block, error) %>% ungroup() %>% filter(is.na(N)) %>% select(S) %>% unique()

    Not sure it will work universally, or that it is the most elegant solution..

    I am also struggling with the actual exclusion of rows where S is in exclude, if I use S as factor (which makes sense, it is not a numeric variable), because the %in% comparison can compare the underlying factor values instead of factor labels.. Then one has to do something ugly like:

    agg.selected % filter(!(as.numeric(labels(S))[S] %in% exclude))

Leave a Reply to Ondrej Havlicek Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.