Getting started with PETER as a user-defined index in Oracle DB
PETER is a prefix-tree based indexing algorithm supporting approximate search and approximate joins. It combines an efficient implementation of compressed prefix trees with advanced pruning techniques (length filtering, frequence filtering, q-gram filtering). PETER is written in C++ and can be used as a:
- UNIX command line tool
- user-defined index in Oracle DB
- shared library for individual programs
PETER features Hamming and Edit distance as similarity measures. Our tool has been evaluated on various collections of Expressed Sequence Tags (ESTs) from dbEST with up to 5,000,000 entries of lengths up to 3,500 characters. PETER is faster by orders of magnitude compared to agrep, nrgrep for search queries and compared to user-defined functions for similarity joins inside Oracle DB. For a detailed evaluation, see:
Rheinländer, A., Knobloch, M., Hochmuth, N. and Leser, U.: Prefix Tree Indexing for Similarity Search and Similarity Join on Genomic Data. Int. Conf. on Statistical and Scientific Databases, Heideberg, Germany, 2010.
PETER was developed for indexing Expressed Sequence Tags and currently only deal with strings consisting of the letters A,C,G,T. We are working on this issue.
Installing PETER
Downloads
Source code version 0.3 (2010-04-25)
EST flat files taken from NCBI dbEST and prepared for indexing with PETER