Metadata & Standards

LINCS Metadata Standards

Working with the LINCS Data Working Group (DWG), HMS LINCS is very active in helping to develop standards for the metadata describing LINCS reagents, assays, and experiments. Up-to-date versions of the metadata standards developed by the LINCS DWG and use cases that describe potential applications for LINCS data can be found on the LINCS Project website Data Standards page.

 

Additional HMS LINCS data formats and file types

MIDAS/CSV

MIDAS is a tabular (spreadsheet) representation of multi-factorial, multi-dimensional experimental datasets. Each row in a MIDAS table corresponds to a unique sample in the experiment, and each column corresponds to a unique experimental variable (factor) such as a treatment condition or assay readout. Column names prefixed with TR: denote treatment conditions (independent variables); the DA: prefix denotes “data acquisition”, i.e. readout timepoints (dependent variables); and the DV: prefix denotes “data values” or actual readout values. The intersecting cells store the values for each sample-variable combination. MIDAS was defined in conjunction with the software project DataRail, so for more information please refer to the publication describing DataRail:

Saez-Rodriguez et al. (2008) Flexible Informatics for Linking Experimental Data to Mathematical Models via DataRail. Bioinformatics. 24, 840-847.

Although the format can be stored in any spreadsheet or text-delimited file format, we have chosen to use the CSV (comma separated value) format for all our MIDAS files.

DataPflex/CSV

DataPflex is a tabular format very similar to MIDAS, defined alongside the software project of the same name. DataPflex also maps rows to samples but uses a more compact column mapping than MIDAS for the independent variables (Data Description Block in DataPflex terms). For more information refer to the DataPflex paper:

Hendriks et al. (2010) DataPflex: a MATLAB-based tool for the manipulation and visualization of multidimensional datasets. Bioinformatics. 26, 432-433.

Like MIDAS, DataPflex can be stored in any tabular file format, but we have standardized on CSV.

Factor dictionary

Factor names in a data file (i.e. MIDAS column headings and DataPflex Data Description Block values) are intended to be short handles which uniquely and efficiently identify a given cell line, gene, small molecule, etc. within the scope of the given dataset. However, these short names alone are not intended to exhaustively encode reagent identities nor to be globally unique across datasets. The factor dictionary is an Excel .xls spreadsheet that aims to define all factor names in a dataset with sufficient resolution to enable correct biological interpretation of the data. Each worksheet corresponds to a different type of factor, each of which requires a more or less consistent set of annotations.

Protocol

A full experimental protocol is necessary for a complete understanding of an experiment and especially for the ability to reproduce it. A full protocol is provided for all datasets in the HMS LINCS Database in the Detail tab accompanying each dataset.

Salt codes

The salts table in the HMS LINCS Database lists the various salts and counter-ions added to compound formulations, generally to improve solubility. These salts are usually not relevant to a compound’s activity, but HMS LINCS still tracks this information for completeness. The small molecule perturbagen metadata field “HMS LINCS Salt Form ID” is a reference to the “HMS LINCS ID” column of this table.

ISA-Tab

HMS LINCS also is exploring use of the ISA-Tab file format for storing, organizing, and sharing metadata for experimental and data analysis protocols.