Build a Kmer Database from a Table of Sequences

p3-build-kmer-db.pl [options] idCol outFile

This script creates a kmer database. The basic model of the database is that we have groups of incoming sequences, each with an ID and a name. So, for example, a group could be a whole genome with each sequence a contig, or a group could be a specific protein with only one sequence per group– the protein itself and the name the protein’s role. Names are entirely optional.

The database will map each kmer to a list of the groups to which it belongs. Command-line options allow you to specify that common kmers be eliminated or that the kmers be discriminating (that is, unique to only one group). The kmer database can then be used as input to various other scripts (such as p3-closest-seqs).


The positional parameters are the column identifier for the column containing the group ID and the name of the output file into which the kmer database is to be stored. The constant string fasta can be used for the group ID column if a FASTA file is input. In that case, the sequence ID is the group ID and the comment is the group name.

The standard input can be overriddn using the options in Input Options.

The options in Column Options can be used to specify the input column containing the sequence text. The default is the last input column.

Additional command-line options are the following.


The size of a kmer. The default is 15.


The maximum number of times a kmer can appear. A kmer appearing more than the specified number of times is considered common and discarded. A value of 0indicates all kmers should be kept. The default is 10.


The index (1-based) or name of the input column containing the group names.


If specified, only discriminating kmers (that is, kmers unique to a single group) are kept. In this case, the --max option is ignored.