Research Group: Learning with Neural Methods on Structured Data (LNM) Department of Mathematics and Computer Science, University of Osnabrück, Germany Minidoc for srng package with general metrics ============================================= Day of creation: Fri Feb 22 15:11:49 2002 Last update: Mon Mar 10 12:07:53 2003 Intro Learning Vector Quantization (LVQ) is a prototype based supervised neural method for data classification. Extensions are LVQ 2.1, GLVQ and GRLVQ. Supervised relevance neural gas (srng) is a combination of GRLVQ and the neural gas method of Ritter and Martinetz. This package provides an implementation of srng that might be used for experimental purposes and research activity, only. Don't use the implementation for any security related tasks. The algorithmic background of supervised relevance neural gas (SRNG) can be found in: Barbara Hammer, Marc Strickert and Thomas Villmann: Learning Vector Quantization for Multimodal Data. (http://www.inf.uos.de/lnm) Have fun, Marc Strickert (email: mstricke@uos.de) Overview of this manual: 1. Requirements 2. Quick start 3. srng executable 3.A. TRAIN MODE 3.B. TEST MODE 4. Exemplary srng initfile 5. Format of the data files 6. awk - helper files Requirements ============ The contents of this package has been tested for + Win2K/cygwin (1.3.3) + Linux (SuSE 7) Standard UNIces environments should work, too, but gnu-awk (awk <= gawk) and a current shell version are heavily recommended. Below, visualization refers to the gnuplot program; if You don't have it, just apply Your favorite plotter to the output files. A good to have is a lot of patience, because the usage of the program is not quite intuitive, due to the polymorphic symantics of its arguments. Why much checking and data processing in clumsy C if You can write small awk wrappers? Therefore, I recommend to use the trainer.awk program for training purpose. Quick start =========== Type: $> make test # compiles binary and produces chkbrd_small_cb_sort.dat $> gnuplot gnuplot> load 'chkbrd_small.plt' # plot test data and protoype trajectories Other make targets are: make test_h10 # 10-dim. Data set;gnuplot -> load 'h_10data.plt' make test_h10_lambda # 10-D relev. factrs;gnuplot -> load 'h_10data_lambda.plt' make test_h10_recfield# classification demo;gnuplot -> load 'h10_recfield.plt' make mushroom_test # invokes trainer.awk in test mode. Watch output. make mushroom_train # invokes trainer.awk in training mode make mushroom_train_2file # like mushroom_train, but writing stats into files The most instructive way to get used to the package is 1) looking at the makefile 2) looking at the headers of the .awk scripts 3) looking at the srng.[ch] files in an editor with un*x file format support. srng executable =============== The executable 'behaves' according to the metric (distance_xxx.c) included at compile time. The compiled file should thus carry an appropriate name, such as srng_euclidian{.exe} or the like. Depening on the included metric, an according additional string parameter "x1 x2 ... xN" is passed to the program. Two major modes are available for running the srng executable, the training and the testing mode. A. TRAIN MODE A typical invocation of the Euclidian srng executable looks like this: > cat chkbrd_small.dat|./srng chkbrd_small_init.dat 250 5 "" > chkbrd_small_pt.dat srng *always* reads the training data from the standard input channel, here, provided by the cat command. In the above case, chkbrd_small_init.dat is taken as inititialization file (see description below), a number of 250 cycles, i.e. number of presentations of the whole trainig file, is used, and each 5th cycle produces a snapshot of parameters, prototypes, and metric weights. The empty string "" could have been omitted because the standard metric does not require additional parameters. Equivalently, > cat chkbrd_small.dat chkbrd_small_init.dat|./srng - 250 5 >chkbrd_small_pt.dat the training data can be concatenated with the initialization file, where the '-' indicates that the initialization is to be read from standard input. B. TEST MODE The output of the srng snapshots can be used for testing: > cat chkbrd_small.dat chkbrd_small_init.dat|./srng - 0 -1 This call used the two files available in the package. Usually, of course, one assumes a test set like chkbrd_small_tst.dat instead of chkbrd_small.dat and a corresponding trained file with prototypes chkbrd_small_pt.dat instead of the _initialization file. So don't expect a good result for the line above. First srng parameter (cycles) after - 0 cycle indicates test mode -1 cycle indicates test mode with elimination of prototypes with bad (<=50%) classification accuracy. Second srng parameter (validate steps = moduloprint) after - Select record number n. If n < 0, do no select special record, but print all of them. If You add | grep '^-1 ' to the above command line, only the overall classification accuracy is printed. The grep '^-1 ' filters the line containing the overall accuracy on the test set. Removing grep provides information about the correct classification of each prototype. NaN (not a number) is displayed for prototypes with empty receptive fields for the given test set. The command > cat chkbrd_small_tst.dat chkbrd_small_pt.dat|./srng - 0 42 "" | grep '^-1 ' would print the 42th record of the chkbrd_small_pt.dat prototypes. After stripping the preceding prototypes statistic lines these lines can be used as initialization file for further training. The general appearance of the initialization file is given now. Exemplary srng initfile ======================= (comments) below in paretheses must not appear in the original file. **************** chkbrd_small_init.dat ************************************** (i) 9(number_prototypes) 0.25(coord-adapt-rate) 1e-6(lambda-adapt-rate) 0.0(lambda-weight-decay) 0.95(neighborhood-decay-rate) -1(neighborhood-inititial-size) 0.5(ng2grlvq-fader-start) 0.5(ng2grlvq-fader-end) 0(random-seed) (p) 0.655123 0.572721 1 (first prototype for class 1) (p) 0.0654028 0.287699 1 (p) 0.486939 0.462719 1 (p) 0.124859 0.431446 1 (p) 0.908377 0.998554 1 (p) 0.815472 0.159003 0 (prototype order doesn't matter) (p) 0.252102 0.160793 0 (class numbers should start with 0) (p) 0.211522 0.975037 0 (class numbering should not contain gaps) (p) 0.653517 0.400613 0 (l) -1 ***************************************************************************** (i) is the parameter initialization. number_prototypes: tell that this number of following lines are prototypes coord-adapt-rate: is usually between 0.0001 and 0.5 lambda-adapt-rate: is usually between 0.0 and about coord-adapt-rate/1000 lambda-weight-decay: is experimental between 0 and about 1e-6 neighborhood-decay-rate: controls shrinking of neighborhood size per cycle neighborhood-inititial-size: 0 means number of prototypes which seems useful ng2grlvq-start: start of amount of wrong prototype pushing (0 => neural gas) ng2grlvq-end: end of influence (1 => closest wrong = closest correct amount) random seed: if 0 -> generate random seed, else given number taken as seed. (p) prototype lines (l) lambdas / metric weights (one or dimension of data) -1 means initialize all dimensions equally. Above example: -1 => 0.5 0.5 after training, this line can be used for analysis of dimension relevances. Note that in contrast to the previous program version there is no line with block weights any more. Format of the data files ======================== The first line of a data file contains #number_of_lines dimension dimension excludes the class label dimension, hence there are dimension+1 columns. Feature vectors are arranged as space (ascii 32) separated real numbers. The class label is in the last column. Class labels start with 0 and the numbering 0,1,2,...K must not contain gaps. Example from chkbrd_small.dat: #199 2 0.0833333 0.458333 0 0.0833333 0.5 0 0.0833333 0.541667 0 [...] 0.916667 0.833333 1 0.916667 0.875 1 some awk - helper files ======================= augment.awk generate a training file with equal class distribution cnorm.awk normalize each column to interval [0,1] discrete2unary.awk generate a real data vector representation of symbolic protosfromdata.awk obtain prototypes from training file shuffle.awk shuffles lines in a (data) file toinput.awk converts a data table into the correct srng format trainer.awk recommended for batch mode processing (X-validation, etc) ztrans.awk calculate a z-transform for columns of given data ztrans_inv.awk calculate a inverse z-transform for columns of given data ztrans_apply.awk z-transform columns of given data by result of ztrans.awk See the file headers. #EOF