Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

lncRScan-SVM: A Tool for Predicting Long Non-Coding RNAs Using Support Vector Machine

  • Lei Sun ,

    leisuncumt@yahoo.com

    Affiliations School of Information Engineering, Yangzhou University, Yangzhou, Jiangsu Province, China, Jiangsu Co-innovation Center for Prevention and Control of Important Animal Infectious Diseases and Zoonoses, Yangzhou University, Yangzhou, Jiangsu Province, China

  • Hui Liu,

    Affiliation School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, JiangSu Province, China

  • Lin Zhang,

    Affiliation School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, JiangSu Province, China

  • Jia Meng

    Affiliation Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu Province, China

Abstract

Functional long non-coding RNAs (lncRNAs) have been bringing novel insight into biological study, however it is still not trivial to accurately distinguish the lncRNA transcripts (LNCTs) from the protein coding ones (PCTs). As various information and data about lncRNAs are preserved by previous studies, it is appealing to develop novel methods to identify the lncRNAs more accurately. Our method lncRScan-SVM aims at classifying PCTs and LNCTs using support vector machine (SVM). The gold-standard datasets for lncRScan-SVM model training, lncRNA prediction and method comparison were constructed according to the GENCODE gene annotations of human and mouse respectively. By integrating features derived from gene structure, transcript sequence, potential codon sequence and conservation, lncRScan-SVM outperforms other approaches, which is evaluated by several criteria such as sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC) and area under curve (AUC). In addition, several known human lncRNA datasets were assessed using lncRScan-SVM. LncRScan-SVM is an efficient tool for predicting the lncRNAs, and it is quite useful for current lncRNA study.

Introduction

Recently tens of thousands of long non-coding RNAs (lncRNAs) have been discovered in the transcriptome using biotechnology such as cDNA cloning [1, 2], tiling mircoarray [35] and high throughput sequencing [6, 7]. Studies also reveal that the lncRNAs are extensively involved numerous cellular processes, such as embryonic stem cell (ESC) pluripotency, erythropoiesis, cell-cycle regulation and diseases [811]. However, current lncRNA function studies can be hampered by lack of complete and high-quality lncRNA gene annotations, especially when conducting analysis on the genome scale. Although there appear several lncRNA data sources, such as lncRNAdb [12], NONCODE [13] and GENCODE [14], they seldom perform well-matched intersection between each other [15], which implies that the lncRNA catalogue needs to be developed. Meanwhile, with the widespread usage of deep sequencing in life science, more and more novel lncRNAs can be found due to their tissue-specific expression characteristic. The newly-discovered lncRNAs are always compared with previous annotations for guaranteeing the quality of further analysis [8, 9].

Either for lncRNA gene annotation or novel lncRNA discovery, it is crucial to evaluate the protein coding potential of a transcript. As similar as protein-coding genes, lncRNAs are RNA polymerase II products, and can be capped and polyadenylated [16], and also present similar histone-modification profiles, splicing signals and exon/intron lengths [15]. Due to the similarities between mRNAs and lncRNAs, it is challenging to separate the lncRNA transcripts (LNCTs) from the protein coding ones (PCTs) [17].

Thanks to the advance of bioinformatics, discriminating LNCTs from PCTs can be modelled as a binary classification problem, which has been solved by several computational methods, such as CONC [18], CPC [19], CNCI [20], iSeeRNA [21] and RNAcon [22]. CONC integrates various features in its classification model, but it is slow for large datasets and also performs less accurate than other newer methods such as CPC [19]. CPC uses features derived from open reading frame (ORF) and sequence alignment, and the developers also provide users with a web interface. CNCI distinguishes protein-coding and non-coding sequences by profiling adjoining nucleotide triplets, however it cannot work on large datasets. iSeeRNA outperforms previous methods and it also provides users with a program for training a new classification model based on custom dataset. RNAcon applies features of tri-nucleotide composition to the classification, but it does not show better performance than other methods in our experiment. Compared to these support vector machine (SVM) based methods inspecting the entire transcript, a comparative genomics method named PhyloCSF [23] focuses on classifying protein-coding and non-coding regions, and it has been frequently used for lncRNA identification [7, 24]. In addition, CPAT [25] is another tool for assessing coding potential of a transcript using an alignment-free logistic regression model. Based on these computational methods, lncRNA function studies usually build a pipeline to obtain a set of confident lncRNAs [16, 24, 26]. For example, a simple pipeline consisting of length filtering (> 200nt), CPC and BLAST [27] can make a stringent lncRNA dataset. However, it is still not convenient to master such non-standard filtering workflow.

As various information and data about lncRNAs are preserved by previous studies, it is appealing to develop new methods to identify the lncRNAs more accurately. By integrating features derived from gene structure, transcript sequence, potential codon sequence and conservation, a novel computational method is proposed here for solving the problem.

Methods

Gold-standard datasets

Since a reliable dataset is of importance to model training and testing, we adopted the high-quality manually-curated GENCODE gene annotations (version 19/v19 of human and version M2/vM2 of mouse) [14, 15] for constructing the gold-standard datasets, which are composed of gene annotations in Gene Transfer Format (GTF) and corresponding genome sequences in FASTA format for human and mouse respectively. Specifically, the human gene annotation (GENCODEv19) includes 81814 PCTs and 23898 LNCTs, while the counts are 47394 and 6053 respectively for mouse (GENCODEvM2). In addition, the genome sequences of human (GRCh37/hg19) and mouse (GRCm38/mm10) were downloaded from University of California Santa Cruz (UCSC) genome browser respectively [28]. For model training and testing, we created Training-A, Testing-A and Testing-B (S1 Dataset) by splitting the gold-standard datasets (See Table 1). All transcripts in training and testing sets were randomly sampled from the complete dataset, and there are no overlapping areas between the training and testing sets. Since our prediction model was trained on Training-A, in which a proportion of transcripts may have similar sequences to that in Testing-A, which could lead to an unfair comparison between our method and the others though our method does not only use sequence features. To make a fairer comparison, Testing-B datasets were created by removing similar transcripts from Testing-A using CD-HIT (v4.6.1, cut-off 0.8) [29], compared to Training-A sequences. To alleviate the effect of imbalanced-classes, we constructed the training and testing sets with equal or similar sizes.

Feature selection

Since each feature selected can affect the overall classification performance, we conducted feature selection (FS) (See S1 File) based on a set of candidate features, which were derived from our current knowledge about lncRNAs. The candidate features can be classified into three categories. The first category includes features extracted from nucleotide sequences, and they are 14 tri-nucleotides attributes, namely ACG, CCG, CGA, CGC, CGG, CGT, CTA, GCG, GGG, GTA, TAA, TAC, TAG and TCG [22], and standard deviation of stop codons (TAG, TAA and TGA) between three frames, and GC content. The second category was extracted from the output of a program called txCdsPredict from UCSC, which is used to predict a codon sequence with the most likelihood from an input nucleotide sequence, and it includes features like txCdsPredict score, CDS length and CDS percentage (CDS_length divided by transcript_length) [21]. The third category of features was extracted from gene structure of the transcript, and they are transcript length, exon count and average exon length.

Then six features (See Table 2) were selected from the total candidates by FS. First, transcript length was selected since the length of PCTs and LNCTs can be differentially distributed [7, 15, 24]. Second, with an assumption that a true PCT may have a long ORF in one of the three frames translated, we presume that the stop codons in the frame where the ORF emerges are fewer than that randomly appear in the other two frames. Therefore, we selected the standard deviation (SCS) of stop codon counts (SCC) between three frames translated as the second feature since the standard deviation of PCT should be smaller than that of LNCT (See Eqs (1) and (2)). (1) (2)

where denotes the mean of stop codon counts of three frames (SCC0, SCC1 and SCC2). Then SCS can be calculated using Eq (2). Similarly, since the PCT may have ORF potentially, we used txCdsPredict to predict the codon sequence (CDS) of the candidate transcript, and the score output by txCdsPredict was selected as the third feature. By analysing the gene structures of transcripts, the exon count and average exon length were selected as features because the lncRNA genes may have fewer exons and be shorter than the protein-coding ones. The last feature we selected is the conservation score calculated by averaging PhastCons scores of the transcript.

lncRScan-SVM

The six features selected above were fed into the SVM framework to build a tool named lncRScan-SVM (v1.0.0), which also depends on several third-part packages/programs such as BioPython [30], LIBSVM [31] and txCdsPredict. Of the programs packaged by lncRScan-SVM, two scripts can commonly be used. One is lncRScan-SVM-train.py for training a SVM model and the other one is lncRScan-SVM-predict.py for predicting PCTs/LNCTs (See Fig 1). The lncRScan-SVM package is freely available on http://sourceforge.net/projects/lncrscansvm/?source=directory.

thumbnail
Fig 1. lncRScan-SVM.

(a) Model training. A GTF file of positive samples, a GTF file of negative samples and a configure file are input to lncRScan-SVM-train.py for training a SVM model and getting scaling parameters. (b) lncRNA prediction. A GTF file of query transcripts and a configure file are input to lncRScan-SVM-predict.py for predicting PCTs or LNCTs.

https://doi.org/10.1371/journal.pone.0139654.g001

Model training.

A SVM model for prediction can be generated using lncRScan-SVM-train.py (See Fig 1a). Specifically, the lncRScan-SVM-train.py program first extracts nucleotide sequences according to the gene models of training samples. Then a script extract_features.py is used to extract transcript features. After transferring features to a standard format of LIBSVM by feature2svm.py, a program svm-scale is used to scale the features. Finally, a prediction model can be generated.

lncRNA prediction.

Users can use lncRScan-SVM-predict.py to predict LNCTs or PCTs (See Fig 1b). Specifically, the main input of lncRScan-SVM-predict.py is a GTF file, which lists all transcripts for classification. Then the sequence of the query transcript is extracted using gffread [6]. After feature extracting, formatting and scaling, svm-predict classifies PCTs and LNCTs using the prediction model trained previously. And the prediction models for hg19 and mm10 have been packaged in lncRScan-SVM.

Evaluation

The performance of lncRScan-SVM is compared to other methods, such as CPC (v0.9-r2), CPAT (v1.2.2), iSeeRNA (v1.2.1) and RNAcon (v1.0), by several criteria, such as sensitivity (SES) or true positive rate (TPR) or recall, specificity (SPC), accuracy (ACC), Matthews correlation coefficient (MCC). (3) (4) (5) (6)

where TP, P, TN, N, FP, FN denote true positive, positive, true negative, negative, false positive and false negative respectively. In this paper, ‘positive’ and ‘negative’ correspond to PCT and LNCT respectively. And area under curve (AUC) of receiver operation characteristic (ROC) are also used as another indicator [32]. In addition, PhyloCSF is not took into account as it focuses on distinguishing protein-coding regions from non-coding ones, which is slightly different from the objective of the methods for classifying PCTs and LNCTs.

Ethics Statement

N/A

Results and Discussion

Performance of single feature

Since every candidate feature can contribute to the overall classification performance of a predictor, the performance of each feature was evaluated by AUC scores on the hg19 and mm10 Training-A datasets respectively (See Fig 2). As seen from the ranking result, the top eight features are CDS length, consv, CDS score, CDS percentage, exon count, stop codon std, transcript length and CGA, which show slightly different orders between hg19 and mm10. Particularly, five of the six lncRScan-SVM features, namely transcript length, stop codon std, CDS score, exon count and consv, are included in the top eight ranked features, which indicates that most of the lncRScan-SVM features can contribute positively to the predictor. In addition, the features of hg19 and mm10 represent similar performance as seen from the the Pearson correlation coefficient (0.9484604) of the AUC scores, which shows the robustness of classification performance of the candidate features between the species.

thumbnail
Fig 2. Feature ranking by AUC scores.

(a) The candidate features are ranked by AUC scores calculated on hg19 Training-A; (b) The candidate features are ranked by AUC scores calculated on mm10 Training-A. Each feature ID is labelled in brackets.

https://doi.org/10.1371/journal.pone.0139654.g002

Comparison of prediction methods

The prediction performance of lncRScan-SVM was evaluated by comparing to several other methods, such as CPC, CPAT, iSeeRNA and RNAcon, on Testing-A and Testing-B. When testing iSeeRNA, we used two kinds of models. One is the default iSeeRNA prediction model and the other one (iSeeRNA2) is generated by re-training iSeeRNA on Training-A. As a result, lncRScan-SVM outperforms other methods in most aspects, such as SPC, ACC, MCC and AUC (See Tables 3 and 4).

thumbnail
Table 3. Method comparison on Testing-A and Testing-B of hg19.

https://doi.org/10.1371/journal.pone.0139654.t003

thumbnail
Table 4. Method comparison on Testing-A and Testing-B of mm10.

https://doi.org/10.1371/journal.pone.0139654.t004

As seen from the testing result on Testing-A for either hg19 or mm10, lncRScan-SVM shows slightly smaller SPC, but higher SES and overall ACC, MCC and AUC scores than CPAT. In contrast, RNAcon presents the worst performance as it obtains the lowest score in most aspects. CPC obtains the highest SES (97.62% for hg19 and 98.371% for mm10), but it shows poor performance in other aspects, e.g. the lowest SPC score (67.23% for hg19 and 75.457% for mm10), which means CPC may ignore a considerable proportion of true lncRNAs. By training on Training-A, iSeeRNA2 shows better performance than the default one, but it is still less accurate than lncRScan-SVM.

Meanwhile, similar performance of these methods can be observed from the testing result on Testing-B of hg19 and mm10, where our lncRScan-SVM predictor consistently shows better performance than other methods. As seen from the result on mm10 Testing-B (Table 4), lncRScan-SVM is ranked the highest in almost all aspects, including SPC, ACC, MCC and AUC. As indicated by SES, CPC still has a bias towards predicting PCTs, but has poor overall performance. Although the re-trained iSeeRNA2 shows similar overall performance to lncRScan-SVM, its SPC scores are much lower than the latter one, which means iSeeRNA2 would perform less accurate for identifying lncRNAs than lncRScan-SVM. In addition, the AUC scores of these methods are visualized by ROC curves (Figs 3, 4, 5 and 6).

thumbnail
Fig 3. ROC curves of lncRNA prediction on hg19 Testing-A.

The prediction performance of CPC, CPAT, lncRScan-SVM, iSeeRNA, iSeeRNA2 and RNAcon on hg19 Testing-A is illustrated by ROC curves with colors of blue, orange, red, green, purple and pink respectively. The definitions of the sensitivity for x axis and specificity for y axis are the same as Formulas 3 and 4. The top-left area is zoomed in on for distinguishable observation.

https://doi.org/10.1371/journal.pone.0139654.g003

thumbnail
Fig 4. ROC curves of lncRNA prediction on mm10 Testing-A.

The description of ROC curves for tests on mm10 Testing-A is the same as Fig 3.

https://doi.org/10.1371/journal.pone.0139654.g004

thumbnail
Fig 5. ROC curves of lncRNA prediction on hg19 Testing-B.

The description of ROC curves for tests on hg19 Testing-B is the same as Fig 3.

https://doi.org/10.1371/journal.pone.0139654.g005

thumbnail
Fig 6. ROC curves of lncRNA prediction on mm10 Testing-B.

The description of ROC curves for tests on mm10 Testing-B is the same as Fig 3.

https://doi.org/10.1371/journal.pone.0139654.g006

In addition to the accuracy evaluation, the computation time of these methods was also compared on the same platform (Linux, Ubuntu 12.04.4 LTS 64bit, 800 MHz × 4 processors and 4GB RAM). As seen from Table 5, lncRScan-SVM shows the longest computation time (Nearly twice slower than iSeeRNA, four times slower than RNAcon and ten times slower than CPAT) when conducting lncRNA prediction on Testing-A of hg19 and mm10 respectively. The computation time difference can be caused by several reasons, such as different algorithms and programming languages. As seen from the table, either of the methods with query file in FASTA, namely CPAT and RNAcon, represents shorter running time than the other ones with GTF is mainly because the latter one needs longer time to extract sequence information according to the gene annotation in GTF. CPAT is faster than RNAcon because the logistic regression model used by the former one could be faster than SVM used by RNAcon. And our lncRScan-SVM is slower than iSeeRNA is mainly because we used Python scripts to implement our method, which can be optimized and accelerated using C programming in the future. Despite of that, the computation time of current lncRScan-SVM could be acceptable since its running time for precessing thousands of transcripts is still limited in several minutes or seconds. It is worth noting that CPC was not taken into account in the comparison because it is quite slow when running on the same platform. For example, it takes about 3 minutes to process only one sequence, which is much slower than the other methods.

thumbnail
Table 5. Computation time comparison on Testing-A of hg19 and mm10.

https://doi.org/10.1371/journal.pone.0139654.t005

Prediction on known human lncRNA datasets

In addition to the high-quality GENCODE lncRNA dataset, previous studies have published several other lncRNA datasets, such as the NONCODEv4 dataset containing 92343 human lncRNAs [13] and the Cabili et al. dataset containing 14353 human long intergenic ncRNAs (lincRNAs) [7], which can be evaluated by lncRScan-SVM. Before conducting lncRNA prediction, we checked the overlap between these datasets. As seen from Fig 7 plotted by eulerAPE v3 [33], the human lncRNA annotated by NONCODEv4 contains most of the lncRNAs annotated by Cabili et al. (21106, ∼ 88.32%) and GENCODEv19 (13546, ∼ 94.38%) due to the fact that NONCODEv4 published on 2014 collects a large number of ncRNAs from various literatures and databases [13]. On the other hand, the overlapping area between Cabili et al. and GENCODEv19 lncRNAs is small (3874, ∼ 16.36% of GENCODEv19 or ∼ 27.24% of Cabili et al.).

thumbnail
Fig 7. Overlap between three human lncRNA datasets.

The lncRNA counts of the three human lncRNA datasets are proportional to the area of the ellipses. The largest ellipse denotes NONCODEv4 lncRNAs. The right-down and right-up ones are for GENCODEv19 and Cabili et al. lncRNAs respectively.

https://doi.org/10.1371/journal.pone.0139654.g007

Then we applied lncRScan-SVM to the Cabili et al. and NONCODEv4 datasets respectively, compared with other methods (See Table 6). As a result, lncRScan-SVM successfully identified 14069 (98.52%) of 14281 lincRNAs annotated by Cabili et al., which were generated by a complex filtering pipeline mentioned in their paper [7]. On the NONCODEv4 lncRNA dataset, lncRScan-SVM successfully predicted 77435 (85.63%) of the totalling 90429 lncRNAs, which is much lower than that on the Cabili et al. dataset due to the fact that the lncRScan-SVM prediction model trained on GENCODEv19 that contains 12152 (50.85%) lincRNAs might have a bias towards predicting lincRNAs rather than non-intergenic lncRNAs. To test this hypothesis, we conducted lncRScan-SVM on the 43730 lincRNAs (the lncRNAs located in the intergenic regions of GENCODEv19 protein coding transcripts) extracted from the NONCODEv4 dataset. As a result, 41984 (96.01%) lincRNA were predicted successfully. Similarly, the prediction ratios of the other methods all decline as test sets change from lncRNA to lincRNA, which means they all have a bias towards predicting lincRNAs. In contrast, most of the other tools represent lower prediction ratios than lncRScan-SVM on the three datasets except CPAT, which has the highest prediction ratio (90.29%) on NONCODEv4 lncRNA dataset since CPAT has a bias towards predicting the lncRNAs (See previous section). However, CPAT still represents poorer performance on either Cabili et al. or NONCODEv4 lincRNAs. Overall, the prediction of these methods on the general lncRNA datasets is consistent with that on our gold-standard sets.

In addition, the remaining 72 lincRNAs of Cabili et al. dataset and 1964 lncRNAs or 1914 lincRNAs of NONCODEv4 were not took into account analysis because they are not annotated on the main reference chromosomes, such as chr1, chr2 and so on, but are annotated on on the other regions such as patches, scaffolds and haplotypes, and they can not be processed by several methods.

Discussion

The lncRNA prediction problem can be solved by several computational methods mentioned above, especially our lncRScan-SVM performs better than the others in several aspects by integrating information extracted from gene structure, sequence composition and conservation in the SVM framework. However several problems remain to be specified. First, novel features and models can be developed to improve the prediction performance. For example, features can be extracted from secondary structure of RNA since functional lncRNAs may have special patterns in the secondary structure. Second, current lncRNA prediction methods present a bias towards predicting lincRNAs, so new distinguishable features should be extracted for predicting those non-intergenic lncRNAs, thereby helping solve the third problem, that is a more detailed catalogue of lncRNAs needs to be created by classifying the lncRNAs into subclasses corresponding to various functions. Last but not least, it is worth noting that classifying PCTs and LNCTs might be meaningless since some lncRNAs can be bifunctional, that is some noncoding RNAs can be translated into peptides in particular circumstances [16, 34].

Therefore, identifying PCTs or LNCTs is not only a simple classification problem, but also fundamental for further analysis, e.g. lncRNA function study. It is hopeful that various datasets and novel hypothesis can help improve the prediction accuracy and further deepen our understanding of lncRNA functions.

Conclusions

Current lncRNA studies have been accelerated by various datasets and efficient bioinformatic tools. Here we proposed lncRScan-SVM, which performs better than several popular methods in predicting the lncRNAs. lncRScan-SVM is quite useful for lncRNA gene annotation and can be integrated into pipelines of lncRNA study.

Supporting Information

S1 Dataset. A compressed file containing GTF files of Training-A, Testing-A and Testing-B transcripts.

https://doi.org/10.1371/journal.pone.0139654.s001

(ZIP)

S1 File. Comparison of three feature selection strategies.

To select a set of good features for model training and testing, we compared three feature selection (FS) strategies.

https://doi.org/10.1371/journal.pone.0139654.s002

(DOC)

Acknowledgments

The authors would like to acknowledge Dr. Lina Ma from Beijing Institute of Genomics, Chinese Academy of Sciences for helpful suggestions.

Author Contributions

Conceived and designed the experiments: LS. Performed the experiments: LS HL LZ JM. Analyzed the data: LS. Contributed reagents/materials/analysis tools: LS. Wrote the paper: LS.

References

  1. 1. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002;420(6915):563–573. pmid:12466851
  2. 2. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, et al. The Transcriptional Landscape of the Mammalian Genome. Science. 2005;309(5740):1559–1563. (Genome Network Project Core Group). . pmid:16141072
  3. 3. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, et al. Global Identification of Human Transcribed Sequences with Genome Tiling Arrays. Science. 2004;306(5705):2242–2246. pmid:15539566
  4. 4. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009;458(7235):223–227. pmid:19182780
  5. 5. Khalil AM, Guttman M, Huarte M, Garber M, Raj A, Morales DR, et al. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proceedings of the National Academy of Sciences. 2009;106(28):11667–11672.
  6. 6. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech. 2010;28(5):511–515.
  7. 7. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes & Development. 2011;25(18):1915–1927.
  8. 8. Alvarez-Dominguez JR, Hu W, Yuan B, Shi J, Park SS, Gromatzky AA, et al. Global discovery of erythroid long noncoding RNAs reveals novel regulators of red cell maturation. Blood. 2013;123(4):570–581. pmid:24200680
  9. 9. Sánchez Y, Segura V, Marín-Béjar O, Athie A, Marchese FP, González J, et al. Genome-wide analysis of the human p53 transcriptional network unveils a lncRNA tumour suppressor signature. Nat Commun. 2014;5.
  10. 10. Loewer S, Cabili MN, Guttman M, Loh YH, Thomas K, Park IH, et al. Large intergenic non-coding RNA-RoR modulates reprogramming of human induced pluripotent stem cells. Nat Genet. 2010;42(12):1113–1117. pmid:21057500
  11. 11. Hung T, Wang Y, Lin MF, Koegel AK, Kotake Y, Grant GD, et al. Extensive and coordinated transcription of noncoding RNAs within cell-cycle promoters. Nat Genet. 2011;43(7):621–629. pmid:21642992
  12. 12. Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Research. 2011;39(suppl 1):D146–D151. pmid:21112873
  13. 13. Xie C, Yuan J, Li H, Li M, Zhao G, Bu D, et al. NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Research. 2014;42(D1):D98–D103. pmid:24285305
  14. 14. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome research. 2012;22(9):1760–1774. pmid:22955987
  15. 15. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome research. 2012;22(9):1775–1789. pmid:22955988
  16. 16. Ulitsky I, Bartel DP. lincRNAs: genomics, evolution, and mechanisms. Cell. 2013;154(1):26–46. pmid:23827673
  17. 17. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, et al. Distinguishing protein-coding and noncoding genes in the human genome. Proceedings of the National Academy of Sciences. 2007;104(49):19428–19433.
  18. 18. Liu J, Gough J, Rost B. Distinguishing protein-coding from non-coding RNAs through support vector machines. PLoS genetics. 2006;2(4):e29. pmid:16683024
  19. 19. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, et al. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Research. 2007;35(suppl 2):W345–W349. pmid:17631615
  20. 20. Sun L, Luo H, Bu D, Zhao G, Yu K, Zhang C, et al. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research. 2013;41(17):e166–e166. pmid:23892401
  21. 21. Sun K, Chen X, Jiang P, Song X, Wang H, Sun H. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC genomics. 2013;14(Suppl 2):S7. pmid:23445546
  22. 22. Panwar B, Arora A, Raghava GP. Prediction and classification of ncRNAs using structural information. BMC genomics. 2014;15(1):127. pmid:24521294
  23. 23. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011;27(13):i275–i282. pmid:21685081
  24. 24. Sun L, Zhang Z, Bailey T, Perkins A, Tallack M, Xu Z, et al. Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study. BMC Bioinformatics. 2012;13(1):331. pmid:23237380
  25. 25. Wang L, Park HJ, Dasari S, Wang S, Kocher JP, Li W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research. 2013;41(6):e74. pmid:23335781
  26. 26. Pauli A, Valen E, Lin MF, Garber M, Vastenhouw NL, Levin JZ, et al. Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. Genome Research. 2012;22(3):577–591. pmid:22110045
  27. 27. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–3402. pmid:9254694
  28. 28. UCSC genome browser;. Available from: http://genome.ucsc.edu.
  29. 29. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. pmid:16731699
  30. 30. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–1423. pmid:19304878
  31. 31. Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011;2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  32. 32. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12(1):77. pmid:21414208
  33. 33. Micallef L, Rodgers P. eulerAPE: Drawing Area-proportional 3-Venn Diagrams Using Ellipses. PLoS ONE. 2014;9(7):e101717. pmid:25032825
  34. 34. Dinger ME, Pang KC, Mercer TR, Mattick JS. Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities. PLoS Comput Biol. 2008;4(11):e1000176. pmid:19043537