Sequence Mining Automata:
a new techniques for mining frequent sequences under regular expressions

Roberto Trasarti
Pisa KDD Laboratory
ISTI - CNR, Italy
Francesco Bonchi
Yahoo! Research
Barcelona, Spain
Bart Goethals
Mathematics and Computer Science Department
University of Antwerp, Belgium

 

Accepted short paper in IEEE International Conference on Data Mining (ICDM 2008)


Software download

Here you can download the executable of the SMA family of algorithms compiled on a Intel centrino Duo 1.66Ghz Processor T2300. This is a unique binary file. You can invoke the different algorithms by means of command line parameters.

SMA Windows Binary

We also implemented the Spirit algorithm from the paper:
Minos N. Garofalakis, Rajeev Rastogi, Kyuseok Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999.
Please find below the Windows binary compiled on a Intel centrino Duo 1.66Ghz Processor T2300.

Spirit Windows Binary

Dataset download

San_Francisco dataset
Protein dataset
Syntetic dataset pack

Dataset format

SMA takes as input an ASCII file which contains for each line a transaction where each character is a symbol.

Example:

MKLDVNALRYLSKDDFRVLTAVEMGMRNHEIVPAELVDRIAGLKHGGTYKVLRNLL
KNKLVHHDATKYDGYRLTYLGYDFLAIKTLVNRGVFASVGRQIGV
GKESDIFEVATEDGTVLAM
KLHRLGRTSFRAVKSKRDYLAHRRSFNWLYLSRLAALKEFAFMKALGDHGFPVPTAVDCNR
HCVIMSLVQGYPLVQVK
ELQNPDDVFDTILGLVVRLAEHGLIHCDFN

All the ASCII characters are considered admissible.

Regular Expression format and download


A RE is passed to the software in a file containing the corresponding automa. The automa is described by means of states definition and edges definition. Together with the automata also the cuts, to be used by the Full-check and Border-Check methods, are possibly passed.
The symbol # in the edges definition represent a wildcard.

Example:

4 The number of automa states
0,0,1,0
1,0,0,1
2,0,0,1
3,1,0,2

The description of the states:
ID, is_output, is_start, cut_id

0 A 0
0 B 1
1 B 2
1 C 2
2 D 2
2 E 3

The description of the edge:
origin ID, Symbol, destination ID

Below you can download the files containing the Regular Expressions used in the paper experimentation:

RE pack

Usage and output information

Usage: sma pattern_type dataset_file automata_file supp method data_reduction [output file, verbose_level, limit]


Mandatory parameters:

dataset_file: input file name

dataset_type: T Trajectories, I Synthetic IBM, P Protein(fasta)

automata_file: name of the file containing the automata

supp: minimum support threshold

method:0 = ONE PASS, 1 = FULL CHECK

data reduction: 0 = DISABLED, 1 = ENABLED

Optional parameters:

output file name

verbose_level: send to standard output a line for every k number of sequences processed.

0 = no verbose, k=the frequency

limit: the maximum transactions. 0 = no limit


To see the usage syntax and a short description of all options, invoke the program without parameters.
When it is launched with the proper parameters, the software will print on the standard output the status of the computation and some statistics of the dataset.