Accepted short paper in IEEE International Conference on Data Mining (ICDM 2008)
| Software download |
Here you can download the executable of the SMA family of algorithms compiled on a Intel centrino Duo 1.66Ghz Processor T2300.
This is a unique binary file. You can invoke the different algorithms by means of command line parameters.
SMA Windows Binary
We also implemented the Spirit algorithm from the paper:
Minos N. Garofalakis, Rajeev Rastogi, Kyuseok Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999.
Please find below the Windows binary compiled on a Intel centrino Duo 1.66Ghz Processor T2300.
Spirit Windows Binary
| Dataset download |
| Dataset format |
SMA takes as input an ASCII file which contains for each line a transaction where each character is a symbol.
Example:
MKLDVNALRYLSKDDFRVLTAVEMGMRNHEIVPAELVDRIAGLKHGGTYKVLRNLL
KNKLVHHDATKYDGYRLTYLGYDFLAIKTLVNRGVFASVGRQIGV
GKESDIFEVATEDGTVLAM
KLHRLGRTSFRAVKSKRDYLAHRRSFNWLYLSRLAALKEFAFMKALGDHGFPVPTAVDCNR
HCVIMSLVQGYPLVQVK
ELQNPDDVFDTILGLVVRLAEHGLIHCDFN
All the ASCII characters are considered admissible.
| Regular Expression format and download |
A RE is passed to the software in a file containing the corresponding automa. The automa is described by means of states definition and edges definition.
Together with the automata also the cuts, to be used by the Full-check and Border-Check methods, are possibly passed.
The symbol # in the edges definition represent a wildcard.
Example:
| 4 | The number of automa states |
| 0,0,1,0 1,0,0,1 2,0,0,1 3,1,0,2 |
The description of the states: |
| 0 A 0 0 B 1 1 B 2 1 C 2 2 D 2 2 E 3 |
The description of the edge: |
| Usage and output information |
Usage: sma pattern_type dataset_file automata_file supp method data_reduction [output file, verbose_level, limit]
Mandatory parameters:
dataset_file: input file name
dataset_type: T Trajectories, I Synthetic IBM, P Protein(fasta)
automata_file: name of the file containing the automata
supp: minimum support threshold
method:0 = ONE PASS, 1 = FULL CHECK
data reduction: 0 = DISABLED, 1 = ENABLED
Optional parameters:
output file name
verbose_level: send to standard output a line for every k number of sequences processed.
0 = no verbose, k=the frequency
limit: the maximum transactions. 0 = no limit
To see the usage syntax and a short description of all options, invoke the program without parameters.
When it is launched with the proper parameters, the software will
print on the standard output the status of the computation and some
statistics of the dataset.