Compute Cluster Signatures

   < peg.data
   > cluster.signatures

The standard input file normally contains [n/a, n/a, family_id, feature_id, contig, start, end, strand, function](the first two columns are ignored). Such a file is created from a family signatures file (output of p3-signature-familes) using the following command

p3-signature-peginfo --gs1=FileOfGenomeIds < family.data > peg.data

However, any file containing the appropriately named columns in the headers (see below) will work.

This script produces a file containing entries of the form

famId1 peg1 func1
famId2 peg2 func2

if the –verbose flag is used. Else, you get 1 line per cluster of the form

genome1 peg11,peg12,...,peg1n1
genome1 peg13,peg14,...,peg1n2
genome2 peg21,peg22,...,peg2n1

that is, you get the genome containing the cluster followed by a comma-separated list of peg ids. There is often a number of clusters for a single genome.

In the non-verbose mode you get a header line containing

genome_id pegs_in_cluster


There are no positional parameters.

The standard input is specified using the options in Input Options. It should be a tab-delimited file with headers, containing the following fields at minimum.


The ID of a protein family.


The ID of a feature in the protein family.


The ID of the contig containing the feature.


The index of the leftmost location for the feature on the contig.


The index of the rightmost location for the feature on the contig.


The strand containing the feature (+ or -).


The function assigned to the feature.

The additional command-line options are as follows.


In normal mode, clusters are written in a readable format, and the family id and the peg function are included for each member of a cluster. In terse mode, each cluster is written on a single line.


Maximum base-pair distance between the midpoints of two features in order for them to be considered close. The default is 2000.