Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/178809 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-178809 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1788092024-07-14T15:37:33Z Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference Peng, Hui Wang, He Kong, Weijia Li, Jinyan Goh, Wilson Wen Bin Lee Kong Chian School of Medicine (LKCMedicine) School of Biological Sciences Center for Biomedical Informatics, NTU Center of AI in Medicine, NTU Medicine, Health and Life Sciences Proteomics Ensemble inference Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows. Ministry of Education (MOE) National Research Foundation (NRF) Published version This research/project is supported by the National Research Foundation, Singapore, under its Industry Alignment Fund-Prepositioning (IAF-PP) Funding Initiative (W.W.B.G.). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation, Singapore. This work was partly supported by the National Innovation Fellow Program of the MOST of China (J.L., Grant No. E327130001). W.W.B.G. also acknowledges the support from an MOE Tier 1 award (RS08/21). J.L. acknowledges the support from his start-up funding grant at Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences. 2024-07-08T01:51:59Z 2024-07-08T01:51:59Z 2024 Journal Article Peng, H., Wang, H., Kong, W., Li, J. & Goh, W. W. B. (2024). Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nature Communications, 15(1), 3922-. https://dx.doi.org/10.1038/s41467-024-47899-w 2041-1723 https://hdl.handle.net/10356/178809 10.1038/s41467-024-47899-w 38724498 2-s2.0-85192527116 1 15 3922 en RS08/21 IAF-PP Nature Communications © 2024 The Author(s). Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Medicine, Health and Life Sciences Proteomics Ensemble inference |
spellingShingle |
Medicine, Health and Life Sciences Proteomics Ensemble inference Peng, Hui Wang, He Kong, Weijia Li, Jinyan Goh, Wilson Wen Bin Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
description |
Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew's correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows. |
author2 |
Lee Kong Chian School of Medicine (LKCMedicine) |
author_facet |
Lee Kong Chian School of Medicine (LKCMedicine) Peng, Hui Wang, He Kong, Weijia Li, Jinyan Goh, Wilson Wen Bin |
format |
Article |
author |
Peng, Hui Wang, He Kong, Weijia Li, Jinyan Goh, Wilson Wen Bin |
author_sort |
Peng, Hui |
title |
Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
title_short |
Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
title_full |
Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
title_fullStr |
Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
title_full_unstemmed |
Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
title_sort |
optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/178809 |
_version_ |
1806059813805752320 |