Cluster mean centering in tidyverse

I'm re-analyzing some old datasets (e.g. from pilots for my dissertation I ran in 2015) and find myself wanting to re-run some multilevel models. However, the first time I did this, I used grand mean centering. That means I combine the within-cluster effects and between-cluster effects into a single parameter estimate (Curran and Bauer have for a great summary).

Instead, I want to cluster mean center. That means calculating the mean of the variable within each cluster, then subtracting the mean of each cluster from the individuals scores in each cluster. Then you include both the cluster means and the cluster mean centered scores in the regression. The coefficient on the cluster means gives an estimate of the between-cluster effect. The parameter on the cluster mean centered scores gives you an estimate of the within-cluster effect. It's important to separate these because they can be different - and might be in opposite directions! (see Simpson's Paradox)

I looked around and couldn't find a simple solution that didn't require installing other packages, so I wrote my own function. It uses just base R and tidyverse (i.e. dplyr). The functions returns a new data.frame/tibble, so it's suitable for chaining together in pipes.

You can paste from below, or download it here

#' Perform cluster mean centering by calculating the mean of column x
#' within each cluster defined by column g. 
#'
#' A new data.frame is returned with two additional columns,
#'   *_m the mean of x within each cluster
#'   *_s the individual values of x, centered at 0
#'
#' @param df data.frame containing the variables
#' @param x char; label of the column to cluster mean center
#' @param g char; label of the column with cluster ids
#' @param label char; to use for new columns (default=x)
#' @param clobber logical; to control behavior when a column <label>_m or <label>_s already exists. TRUE: overwrite existing column. FALSE: exit function (default=FALSE)
#'
#' @return data.frame,tibble
#'
#' @examples
#' df <- tibble(x=rnorm(10), g=rep(c(0,1),5))
#' df <- cluster_mean(df, 'x', 'g')
#' df
#' # A tibble: 10 x 4
#'           x     g    x_m     x_s
#'       <dbl> <dbl>  <dbl>   <dbl>
#'  1 -0.918       0 -0.565 -0.353 
#'  2  0.355       1  0.280  0.0753
#'  3  0.0313      0 -0.565  0.596 
#'  4  0.858       1  0.280  0.578 
#'  5 -0.201       0 -0.565  0.364 
#'  6  1.88        1  0.280  1.60  
#'  7 -0.907       0 -0.565 -0.342 
#'  8 -1.69        1  0.280 -1.97  
#'  9 -0.830       0 -0.565 -0.265 
#' 10 -0.00634     1  0.280 -0.286 
#'
cluster_mean <- function(df, x, g, label=x, clobber=FALSE) {
  require(dplyr)
  # set up new column names
  m_col = str_c(label, '_m')
  s_col = str_c(label, "_s")
  new_cols = c('mreservedkey', 'sreservedkey')
  names(new_cols) <- c(m_col, s_col)

  # check for pre-existing columns to prevent accidental clobbering
  for (col in names(new_cols)) {
    if (col %in% names(df)) {
      if (clobber) {
        warning(str_c('Replacing column: ', col))
        df <- df %>% dplyr::select(-col)
      } else {
        stop(str_c("Created column [",col,"] already exists. Set clobber=TRUE to overwrite"))
      }
    }
  }
  # calculate cluster means
  dm <- df %>%
    dplyr::select(g, x) %>%
    group_by(!!as.name(g)) %>%
    summarize(mreservedkey = mean(!!as.name(x), na.rm=TRUE))
  # add means to df, calculate cluster-centered values, then rename new columns
  d <- df %>%
    left_join(dm, by=g) %>%
    mutate(sreservedkey = !!as.name(x) - mreservedkey) %>%
    rename(!!new_cols)

  return(d)
}

Share this:

Related