Simplified and unified access to cancer proteogenomic data

Published: Nov. 17, 2020, 1:02 a.m.

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.11.16.385427v1?rss=1 Authors: Lindgren, C. M., Adams, D. W., Kimball, B., Boekweg, H., Tayler, S., Pugh, S. L., Payne, S. H. Abstract: Comprehensive cancer datasets recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) offer great potential for advancing our understanding of how to combat cancer. These datasets include DNA, RNA, protein, and clinical characterization for tumor and normal samples from large cohorts in many different cancer types. The raw data are publicly available at various Cancer Research Data Commons. However, widespread re-use of these datasets is also facilitated by easy access to the processed quantitative data tables. We have created a Python package, cptac, which is a data API that distributes the finalized processed CPTAC datasets in a consistent, up-to-date format. This consistency makes it easy to integrate the data with common graphing, statistical, and machine learning packages for advanced analysis. Additionally, consistent formatting across all cancer types promotes the investigation of pan-cancer trends. The data API structure of directly streaming data within a programming environment enhances reproducibility. Finally, with the accompanying tutorials, this package provides a novel resource for cancer research education. Copy rights belong to original authors. Visit the link for more info