Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing and Prediction

*IMS Monograph 1, Bradley Efron (2010)*

Datasets and Programs

The datasets are RData files, with extension .RData. They need to be loaded into R using the "load" command. To use alpha, for example, right-click on the link and select "Save link as...", which will open a dialog on your computer prompting you to save the file. Then load it in R with

`load("alpha.RData")`

The programs are also RData files and require the same procedure above.

**Datasets**

Kidney Data: 157x2 matrix with ages and kidney scores

Prostate Data: 6033x102 matrix, labeled 1 and 2 for control and treatment; also z-values in Figures 2.1 and 4.6

DTI Data: 15443x4 matrix; first 3 columns give (x,y,z) coordinates, fourth column gives z-values

Leukemia Data: 7128x72 matrix, columns labeled 0 and 1 for AML and ALL; also z-values shown in Figures 4.6 and 6.2

Chi-square Data: 301010x3 matrix, column 1 gives genes, columns 2 and 3 give conditions 1 and 2 as in Table 6.1 (which is gene 23 except one of the entries for condition 2 has been changed from 1 to 0); also z-values for the 16882 of the 18399 genes having at least 3 sites

Police Data: z-values, 2749 vector from Figure 6.1

HIV Data: 7680x8 matrix, columns labeled 0 and 1 for healthy controls and HIV patients; also z-values obtained from two-sample t-tests on logged and standardized versions of hivdata

Cardio Data: 20436x63 matrix, first 44 coumns are healthy controls, last 19 are cardiovascular patients; the columns are standardized but the matrix is not doubly standardized

p53 Data: 10100x50 matrix, first 33 columns mutated cell lines, last 17 unmutated; also z-values

Brain Data: 12625 t-values and (x,y) coordinates for smoothing spline

Michigan Data: 5217x86 matrix, first 24 columns bad outcomes, last 62 columns good outcomes

Shakespeare Data: 10x10 matrix

**Programs**

locfdr (CRAN): Produces estimates of the local false discovery rate, both assuming the theoretical null hypothesis as in (5.5), and using an an empirical null estimate as in Sections 6.2 and 6.3. Only needs a vector of z-values, but see the help file for options and possible adjustments. The histogram plot with the estimated null and mixture densities is particularly valuable; it should always be inspected for goodness of fit.

Ebay.RData: Produces the empirical Bayes estimated effect size as in Figure 11.1. All that need be provided is the training data: the N x n predictor matrix X and the n-vector Y of zero-one responses. The help file lists a range of user modifications, including the "folds" option for cross-validation.

simz.RData: Generates an N x n matrix X of correlated z-values. Each z is N(0,1), but they have root mean square correlation approximately equaling a target value. See Section 8.2, and comments at the beginning of "simz" for more information. Non-null versions of X can be obtained by adding constants to selected entries, for example the top 57 rows of the last n/2 columns.

alpha.RData: Takes an N x n matrix X and outputs estimate (8.18) of the root mean square correlation. It is best to first remove potential systematic effects from X. For instance, with the prostate data of Section 2.1, one might separately subtract the gene-wise means from the control and cancer groups (or more simply, apply alpha to the 50 control or 52 cancer columns of X).

Contact Email: brad@stat.stanford.edu

Back to Brad Efron's Home Page