Abstract:DNA sequence classification is a basic task of bioinformatics, which aims at predicting the category of DNA sequences in terms of their structural or functional similarity. In order to perform an effective classification, how to map the sequences into a feature vector space while retaining the chronological relationships hidden in the sequences as much as possible is currently a difficult task. To address the problems of existing methods, which easily result in affecting the classification accuracy because of incomplete representation of the nucleotides in DNA sequences, in this paper, a new feature representation method for DNA sequence is proposed. In the new method, first, each sequence is used to train a Hidden Markov Model (HMM); then, the DNA sequences are projected onto a vector space spanned by the eigenvectors of the HMM state transition probability matrix. Based on the new feature representation, a K-Nearest Neighbour classifier is constructed to classify DNA sequences over the vector space. Experimental results show that the new feature representation is able to represent the chronological relationships between different nucleotides in a DNA sequences more integrally. Consequently, the structural information hidden in the sequences can be reflected fully, which in turn improve the classification accuracy of sequences.