题名

探勘生物序列循序樣式的高效率方法

并列篇名

Efficient Methods for Mining Sequential Patterns of Biological Sequences

DOI

10.6342/NTU.2013.02647

作者

廖強棋

关键词

循序樣式 ; 結構模體 ; 樣式探勘 ; 資料探勘 ; 生物資訊 ; Sequential patterns ; Structured motifs ; Pattern mining ; Data mining ; Bioinformatics

期刊名称

國立臺灣大學電機工程學系學位論文

卷期/出版年月

2013年

学位类别

博士

导师

陳銘憲

内容语文

英文

英文摘要

Scientific progress in recent years has led to the generation of huge amounts of biological data, most of which remains unanalyzed. Mining the data may provide insights into various realms of biology, such as finding co-occurring biosequences, which are essential for biological data mining and analysis. Data mining techniques like sequential pattern mining may reveal implicitly meaningful patterns among the DNA or protein sequences. Pattern mining for biological sequences is an important problem in bioinformatics and computational biology. Sequential pattern mining can reveal all-length motifs in biological sequences. Performing sequential pattern mining on biological sequences helps reveal implicit motifs/patterns, which are usually of functional significance and have specific structures. If biologists hope to uncover the potential of sequential pattern mining in their field, it is necessary to move away from traditional sequential pattern mining algorithms since these algorithms have difficulty in handling small alphabets and long sequence lengths in biological data, such as gene and protein sequences. To tackle the problem, this dissertation proposes an approach called Depth-First SPelling (DFSP) algorithm for mining sequential patterns in biological sequences. DFSP is a general model for mining sequential patterns of biological sequences. The algorithm’s processing speed is faster than that of PrefixSpan, its leading competitor, and DFSP is superior to other sequential pattern mining algorithms for biological sequences. Furthermore, gap constraints are important in computational biology since they cope with irrelative regions, which are not conserved in evolution. An approach is devised to efficiently mine sequential patterns (motifs) of biological sequences with gap constraints in this dissertation. The approach is called the Depth-First Spelling algorithm for mining sequential patterns with Gap constraints in biological sequences (referred to as DFSG). DFSG is a general gap constraint model in sequential pattern mining of biological sequences. GenPrefixSpan is a method based on PrefixSpan with gap constraints, and therefore this dissertation compares DFSG with GenPrefixSpan. In biological sequences, DFSG’s runtime is substantially shorter than that of GenPrefixSpan. The intra- and inter-block gap constraints are shown to effectively cope with the substitutions, insertions, loops, and deletions involved in evolution, but induce even higher computation cost. Hence this dissertation presents an approach called Depth-first spelling algorithm for mining structured motifs with Intra- and inter-Block gap constraints in biological sequences (referred to as DIB) to tackle this challenge. When mining biological sequences, DIB’s runtime is shown to be much shorter than a previously developed projection-based sequential pattern mining algorithm, MAGIIC.

主题分类 電機資訊學院 > 電機工程學系
工程學 > 電機工程
参考文献
  1. [1] A. Achar, S. Laxman, and P. S. Sastry, A unified view of the apriori-based algorithms for frequent episode discovery. Knowledge and Information Systems, pages 223-250, 2011.
    連結:
  2. [4] C. Antunes and A. Oliveira, Generalization of Pattern-growth Methods for Sequential Pattern Mining with Gap Constraints, Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM’03), Lecture Notes in Computer Science, Leipzig, Germany, pages 239–251, 2003.
    連結:
  3. [5] S. Aseervatham, A. Osmani, and E. Viennet. bitSPADE: A latticebased sequential pattern mining algorithm using bitmap representation. Proceedings of 6th International Conference on Data Mining, pages 792-797, 2006.
    連結:
  4. [8] I. Batal, H. Valizadegan, G. F. Cooper and M. Hauskrecht, A Pattern Mining Approach for Classifying Multivariate Temporal Data, Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2011.
    連結:
  5. [10] A. Califano, SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics, 16(4): 341-357, 2000.
    連結:
  6. [12] Y.-C. Chen, W.-C. Peng, and S.-Y. Lee, CEMiner- An Effcient Algorithms for Mining Closed Patterns from Interval-based Data, Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), pages 121-130, Vancouver, Canada, Dec. 11-14, 2011.
    連結:
  7. [14] T.P. Exarchos, C. Papaloukas, C. Lampros, and D.I. Fotiadis, Mining sequential patterns for protein fold recognition. Journal of Biomedical Informatics, 44 (1), pp. 165–179, 2008.
    連結:
  8. [16] T.P. Exarchos, M.G. Tsipouras, C. Papaloukas, and D.I. Fotiadis, A two-stage methodology for sequence classification based on sequential pattern mining and optimization. Data & Knowledge Engineering, 66 (3), pp. 467–487, 2008.
    連結:
  9. [18] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
    連結:
  10. [21] M. Hirosawa, Y. Totoki, M. Hoshida, M. Ishikawa, Comprehensive study on iterative algorithms of multiple sequence alignment, Bioinformatics, Vol. 11, pages 13-18, 1995.
    連結:
  11. [23] C.-M. Hsu, C.-Y. Chen, C.-C. Hsu, and B.-J. Liu, Efficient discovery of structural motifs from protein sequences with combination of flexible intra- and inter-block gap constraints, Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 530–539, 2006.
    連結:
  12. [24] C.-M. Hsu, C.-Y. Chen, and B.-J. Liu. MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Res., W356-W361, 2006.
    連結:
  13. [25] C.-.M. Hsu, C.Y. Chen, B.J. Liu, C.C. Huang, M.H. Laio, C.C. Lin, and T.L. Wu. Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinformatics, 8(Suppl. 5), S8. doi:10.1186/1471-2105-8-S5-S8, 2007.
    連結:
  14. [26] J.-W. Huang, C.-Y. Tseng, J.-C. Ou, M.-S. Chen, A General Model for Sequential Pattern Mining with a Progressive Database, IEEE Transactions on Knowledge and Data Engineering, pages 1153-1167, 11 Feb 2008.
    連結:
  15. [28] I. Jonassen, Efficient discovery of conserved patterns using a pattern graph. Computer Applications in the Biosciences, 13(5): 509-522, 1997.
    連結:
  16. [29] N. Jones, and P. Pevzner, An introduction to Bioinformatics Algorithms. MIT Press, Cambridge, MA, 2004.
    連結:
  17. [30] V. C.-C. Liao and M.-S. Chen, An efficient sequential pattern mining algorithm for motifs with gap constraints. Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2012.
    連結:
  18. [31] V. C.-C. Liao and M.-S. Chen, Efficient Mining Structural Motifs for Biosequences with Intra- and Inter-block Gap Constraints. Proceeding of IEEE International Conference on Intelligence and Security Informatics, 2012.
    連結:
  19. [32] V. C.-C. Liao and M.-S. Chen, DFSP: a Depth-First Spelling algorithm for sequential pattern mining of biological sequences. Knowledge and Information Systems, 2013.
    連結:
  20. [33] M.-Y. Lin and S.-Y. Lee. Incremental update on sequential patterns in large databases by implicit merging and efficient counting. Information System, 29(5):385 – 404, Jul. 2004.
    連結:
  21. [34] M. Lin, S. Lee, and S. Wang, DELISP: Efficient Discovery of Generalized Sequential Patterns by Delimited Pattern-Growth Technology. Advances in Knowledge Discovery and Data Mining, 198-209, 2002.
    連結:
  22. [35] C. Luo and S. M. Chung. Efficient mining of maximal sequential patterns using multiple samples. Proceedings of the 5th SIAM International Conference on Data Mining (SDM’05), pages 415-426, 2005.
    連結:
  23. [37] A. Marascu and F. Masseglia. Mining sequential patterns from data streams: a centroid approach. Journal of Intelligent Information Systems (JIIS), 27(3):291 – 307, Nov. 2006.
    連結:
  24. [38] S. Nguyen, X. Sun, and M. Orlowska. Improvements of IncSpan: Incremental mining of sequential patterns in large database. Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 442-451, 2005.
    連結:
  25. [44] J. Pei, J. Han, and W. Wang. Constraint-based sequential pattern mining: the pattern-growth methods. Journal of Intelligent Information Systems, 28(2):133--160, 2007.
    連結:
  26. [45] C. Pizzuti and S. E. Rombo, A Coclustering Approach for Mining Large Protein-Protein Interaction Networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), v.9 n.3, p.717-730, 2012
    連結:
  27. [46] D. Rajpathak, R. Chougule, and P. Bandyopadhyay, A domain-specific decision support system for knowledge discovery using association and text mining. Knowledge and Information Systems, pages 405-432, 2011.
    連結:
  28. [47] I. Rigoutsos and A. Floratos, Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics, 14(1): 55-67, 1998.
    連結:
  29. [48] A. Salam and M. Sikandar Hayat Khayal, Mining top-k frequent patterns without minimum support threshold. Knowledge and Information Systems, pages 57-86, 2011.
    連結:
  30. [49] P. Senkul and S. Salin, Improving pattern quality in web usage mining by using semantic information. Knowledge and Information Systems, pages 527-541, 2011.
    連結:
  31. [51] J. Wang and J. Han. Bide: Efficient mining of frequent closed sequences. Proceedings of the 20th International Conference on Data Engineering, pages 79 – 91, 2004.
    連結:
  32. [52] X. Wang, G. Li, G. Jiang, and Z. Shi, Semantic trajectory-based event detection and event pattern mining. Knowledge and Information Systems, pages 1-25, 2011.
    連結:
  33. [53] K. Wang, Y. Xu, J. Yu, Scalable Sequential Pattern Mining for Biological Sequences, Proceedings of Conference Information and Knowledge Management, pages 178-187, 2004.
    連結:
  34. [54] X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large datasets. Proceedings of the 3rd SIAM International Conference on Data Mining (SDM’03), pages 166 – 177, May. 2003.
    連結:
  35. [57] M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Machine Learning, Vol. 42, No. 1/2, pages 31-60, 2001.
    連結:
  36. [2] R. Agrawal and R. Srikant, Mining Sequential Patterns, Proceedings of the 11th International Conference on Data Engineering, pages 3-14, Feb. 1995.
  37. [3] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, Journal of Molecular Biology, 215(3):403-410, 1990.
  38. [6] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, Sequential pattern mining using a bitmap representation, Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 429 – 435, Jul. 2002.
  39. [7] P. Bajcsy, J. Han, L. Liu and J. Young, Survey of Biodata Analysis from a Data Mining Perspective, Chapter 2 of Jason T. L. Wang, Mohammed J. Zaki, Hannu T. T. Toivonen, and Dennis Shasha (eds.), Data Mining in Bioinformatics, Springer Verlag, pages9-39, 2004.
  40. [9] BLAST, http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/
  41. [11] M.-S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, 5(1):866 – 883, Dec. 1996.
  42. [13] H. Cheng, X. Yan, and J. Han. Incspan: Incremental mining of sequential patterns in large database. Proceedings of 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 527 – 532, 2004.
  43. [15] T.P. Exarchos, C. Papaloukas, C. Lampros, and D.I. Fotiadis, Protein classification using sequential pattern mining. Proceedings of the IEEE Engineering in Medicine and Biology Conference, New York, USA, pp. 5814–5817, 2006.
  44. [17] J. Han, How can data mining help bio-data analysis? Proceedings of the Workshop on Data Mining in Bioinformatics (BIOKDD'02 with SIGKDD’02 Conference, Edmonton, Canada), pages 1-4, 2002.
  45. [19] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 355 – 359, 2000.
  46. [20] J. Han, J. Pei, and X. Yan, Sequential Pattern Mining by Pattern-Growth: Principles and Extensions, Recent Advances in Data Mining and Granular Computing, W.W. Chu and T.Y. Lin, eds. Springer, pages 183-220, 2005.
  47. [22] C.-C. Ho, H.-F. Li, F.-F. Kuo, and S.-Y. Lee. Incremental mining of sequential patterns over a stream sliding window. Proceedings of IEEE International Workshop on Mining Evolving and Streaming Data (IWMESD-2006), pages 677-681, Dec. 2006.
  48. [27] N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.-S. Langendijk-Genevaux, M. Pagni, and C.-J. Sigrist, The PROSITE database. Nucleic Acids Res., 34 (Database issue): D227-30, 2006.
  49. [36] M. Mampaey, N. Tatti, and J. Vreeken. Tell me what I need to know: Succinctly summarizing data with itemsets. Proceedings of 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 573-581, 2011.
  50. [39] S. Orlando, R. Perego, and C. Silvestri, A new algorithm for gap constrained sequence mining. In SAC ‘04: Proceedings of the ACM symposium on Applied computing. Nicosia, Cyprus edition. Edited by: Anonymous. New York, NY, USA: ACM; 540-547, 2004.
  51. [40] S. Parthasarathy, M. J. Zaki, M. Ogihara, and S. Dwarkadas. Incremental and interactive sequence mining. Proceedings of the 8th International Conference on Information and Knowledge Management, pages 251 – 258, 1999.
  52. [41] J. Pei, J. Han, B. Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach, IEEE Transactions on Knowledge and Data Engineering, pages 1424-1440, Oct 2004.
  53. [42] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, H-Mine: Hyper-structure Mining of Frequent Patterns in Large Databases. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM'01), pages 441-448, San Jose, California, November 29-December 2, 2001.
  54. [43] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, Mining Access Patterns efficiently from Web logs. In Proceedings of the 2000 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00), pages 396-407, Kyoto, Japan, April 2000.
  55. [50] R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements, Proceedings of 5th International Conference Extending Database Technology (EDBT), Vol. 1057, Springer-Verlag, pages 3-17, 1996.
  56. [55] J. Yang, W. Wang, P.S. Yu, and J. Han, Mining Long Sequential Patterns in a Noisy Environment, Proceedings 2002 ACM-SIGMOD I International Conference. Management of Data (SIGMOD '02), pages 406-417, June 2002.
  57. [56] M. J. Zaki. Efficient enumeration of frequent sequences. Proceedings of the 7th International Conference on Information and Knowledge Management, pages 68-75, Nov. 1998.