Wednesday, October 12, 2011

Summary of Data Integration Session at NIH Human Proteome Meeting

I recently attended the NIH Human Proteome Meeting and chaired the session below. I was asked to write a summary and append it below, as some others might find it interesting. 

Subsession 2:   Integrating Proteomics with Other Omics
9:15 – 9:35 a.m. Mark Gerstein , Yale University 
9:35 – 9:55 a.m. Rolf Apweiler, EMBL Outstation European Bioinformatics Institute (EBI)  
9:55 – 10:15 a.m. Joel Bader, The Johns Hopkins University  
10:15 – 10:35 a.m. Robert Gerszten, Massachusetts General Hospital
10:35 – 10:55 a.m. Zhiping Weng, University of Massachusetts Medical School

The second session in the Human Proteome Meeting was devoted to data integration. The speakers noted a number of key themes.

1 * Doing Direct Integration: Gene Expression v. Protein Abundance

Dr. Gerstein talked about integrating quantitative proteomics data, more specifically protein abundance levels, with mRNA gene expression levels. This is in a sense one of the more direct and obvious forms of integration. It can be done in two forms: first in a simplified context circa 2005, through just comparing levels and then in a more elaborate context, which will be possible in the future in relation to allelic expression where we can actually compare maternal and paternal alleles both for gene expression and protein abundance using the exact sequences that come from mass spectrometry or transcriptome sequencing. The "future" case allows for the examination in detail of the effect of specific mutations on gene expression given that the maternal and paternal alleles are perfectly matched controls. For quantitative proteomics it was identified that both for the simpler case of comparing the levels and the allelic case one would need the protein abundance sets, particularly for human and precisely matched against RNA-Seq sets.

2 * Connecting Proteomics Data to the Huge Amount of Variation Data

The idea of connecting the proteome data, particularly in the form of networks with the huge amount of variation data that is coming from personal genomic sequencing, was highlighted by Joel Bader and Mark Gerstein. There was also a bit of discussion from a number of the speakers about the importance of connecting the complex aspects of real proteins to the variation data. In particular it was observed that a single nucleotide polymorphism could differentially affect different transcripts. Moreover, it potentially can have a stronger effect than one might imagine by hitting a site of posttranslational modification. This could be addressed by developing large datasets of protein isoforms and linking these against gene annotations.

3 * The Potential for Integrating Diverse Information with Proteins

All the speakers felt the proteomics data could be integrated with diverse biomedical information. Robert Gerszten discussed the importance of connecting proteomics data with clinical measurements and metabolomics. Dr Gerstein discussed the idea of connecting the protein networks with three-dimensional structures. This latter integration opportunity could be further pursued by solving co-crystal structures of proteins and using these to provide molecular details for interaction networks.

4 * The Complexities and Subtleties of Detailed Integration

The difficulties in achieving data integration in the framework of a working database system were underscored by Rolf Apweiler. He pointed out that in many instances, while one can get most of the integration done, there are a small number of lingering cases that lead to pathological examples. The example that he focused on was connecting the genomics data from ensembl to the proteomics information in uniprot.

Zhiping Weng provided a very good summary of how detailed data integration could be done in a genomics context. She highlighted how chromatin marks might be used to predict gene expression information in the framework of an integrative model. This represents essentially almost the highest form of data integration, moving beyond more than just putting together the datasets to actually exploring how one dataset might be used to "predict" another.