PathUTRNet: Prediction of signaling pathways and microRNA non-coding regions with deep learning
PathUTRNet was created in the context of my MSc thesis during my master's studies in Queen Mary University of London.
Summary
PathUTRNet consists of two deep learning models, which are utilized in a sequential way to accomplish three interconnected tasks:
- The first model's purpose is to identify the existence of a binding site between a miRNA and non-coding region (binary classification).
- Providing there is a binding site between a mIRNA and its non-coding region, the second model predicts both the signal transduction pathway the concatenated sequence (miRNA + UTR) belongs to (150 classes - multi-class classification) and the binding's untraslated region (3'UTR or 5'UTR - binary classification).
Both models combine both CNN and RNN layers together, since this architecture outperforms both paper's simpler counterparts (CNN-based, RNN-based).
Scientific Publication:
Scientific paper to be written together with Professor Rob Krams
Current paper's version is available at:
Note: Paper's content will be frequently updated before acquiring a final draft
Data:
- Signal transduction pathways data were acquired from Reactome
- Genes (Gene symbols) related to these pathways were obtained through a web-based NCBI tool
- DIANA TarBase v.8 was indexed to retrieve pairs of positive and negative pairs of genes and miRNAs.
- Regarding the aforementioned pairs, only the mature miRNA's largest transcript coding, 3'UTR, 5'UTR were considered. These were acquired by leveraging BioMart and Bioconductor.
Additional information about the data acquisition and preprocessing process can be found in the paper.
Sample of results during inference:
Input
- miRNA sequence (mmu-miR-194-5p): TGTAACAGCAACTCCATGTGGA
- target sequence (5'UTR): TCCTGCGCAGTTCTCCGCCGCAGCCTCAGCGGGCAAGCGCCGGGGCTGCTCTCAAT CTCCTGGCTGCGAGGAGGCAGCCCCGGCGAGCTGTCGTGCGCCCCGTCCAGAGTTACTGAGTGCGGGGCACAGC GTAACTGACAGCGCGTCTGCTCACAGTTCCCGTCGCCTGGACTTAGCTTTCCAACCCCGGCTTCTCGTGGGCAT CATGTCAAGAGCCGTCGCCGCTGCAACCGCCGCCGCCACCCGGGGAAGAGCCGCAGCCTCGGCAGCCGCGCGCG CAGGAGGGCAATAAACCGAATCACTCCGGGCTCAAAGTGGCAGGGGACCGTCGCGGTGCTCTCTGTTCCGGCGG GACTCCTGCCATGTGCTGAGCCATGCCCCTGGCCGCGCCCGCGGGCCGCGT
Output
- binding label: 1 (Indication that there is a binding site)
- Pathway predicted: HS-GAG biosynthesis (same with true label/pathway)
- UTR: 5'UTR (same with true label/UTR)
Code & Installation Process are available at:
Written with:
Python, Tensorflow, Keras, Plotly, Pandas, Scikit-Learn, Numpy
Since:
April 2021 - Present