Research Group: Learning with Neural Methods on Structured Data (LNM) 
Department of Mathematics and Computer Science,
University of Osnabrück, Germany



                 Minidoc for srng package with general metrics
                 =============================================


Day of creation:  Fri Feb 22 15:11:49  2002

Last update:      Mon Mar 10 12:07:53  2003


Intro

Learning Vector Quantization (LVQ) is a prototype based supervised
neural method for data classification. Extensions are LVQ 2.1,
GLVQ and GRLVQ. 
Supervised relevance neural gas (srng) is a combination of GRLVQ
and the neural gas method of Ritter and Martinetz.

This package provides an implementation of srng that might be
used for experimental purposes and research activity, only.
Don't use the implementation for any security related tasks.


The algorithmic background of supervised relevance neural gas (SRNG)
can be found in:
  Barbara Hammer, Marc Strickert and Thomas Villmann: 
  Learning Vector Quantization for Multimodal Data.
  (http://www.inf.uos.de/lnm)


Have fun,

  Marc Strickert                        (email: mstricke@uos.de)
 

 

  Overview of this manual:

  1.   Requirements
  2.   Quick start
  3.   srng executable
  3.A.   TRAIN MODE
  3.B.   TEST MODE
  4.   Exemplary srng initfile
  5.   Format of the data files
  6.   awk - helper files



                                  Requirements
                                  ============

The contents of this package has been tested for 
  + Win2K/cygwin (1.3.3)
  + Linux (SuSE 7)
Standard UNIces environments should work, too, 
but gnu-awk (awk <= gawk) and a current shell version are heavily recommended.

Below, visualization refers to the gnuplot program; if You don't
have it, just apply Your favorite plotter to the output files.

A good to have is a lot of patience, because the usage of the program
is not quite intuitive, due to the polymorphic symantics of its
arguments. Why much checking and data processing in clumsy C if You can
write small awk wrappers? Therefore, I recommend to use the 
trainer.awk program for training purpose.


                                  Quick start
                                  ===========

Type:
$> make test          # compiles binary and produces chkbrd_small_cb_sort.dat
$> gnuplot 
gnuplot> load 'chkbrd_small.plt'  # plot test data and protoype trajectories


Other make targets are:

make test_h10         # 10-dim. Data set;gnuplot -> load 'h_10data.plt'
make test_h10_lambda  # 10-D relev. factrs;gnuplot -> load 'h_10data_lambda.plt'
make test_h10_recfield# classification demo;gnuplot -> load 'h10_recfield.plt'
make mushroom_test    # invokes trainer.awk in test mode. Watch output.
make mushroom_train   # invokes trainer.awk in training mode
make mushroom_train_2file # like mushroom_train, but writing stats into files



The most instructive way to get used to the package is 
  1) looking at the makefile
  2) looking at the headers of the .awk scripts
  3) looking at the srng.[ch] files
in an editor with un*x file format support.



                                srng executable
                                ===============

The executable 'behaves' according to the metric (distance_xxx.c) included  
at compile time. The compiled file should thus carry an appropriate
name, such as srng_euclidian{.exe} or the like. Depening on the 
included metric, an according additional string parameter "x1 x2 ... xN" 
is passed to the program.

Two major modes are available for running the srng executable,
the training and the testing mode.


  A. TRAIN MODE

A typical invocation of the Euclidian srng executable looks like this:
                                
> cat chkbrd_small.dat|./srng chkbrd_small_init.dat 250 5 "" > chkbrd_small_pt.dat

srng *always* reads the training data from the standard input channel,
here, provided by the cat command.
In the above case, chkbrd_small_init.dat is taken as inititialization file
(see description below), a number of 250 cycles, i.e. number
of presentations of the whole trainig file, is used, and each 5th
cycle produces a snapshot of parameters, prototypes, and metric weights.
The empty string "" could have been omitted because the standard metric does
not require additional parameters.

Equivalently,

> cat chkbrd_small.dat chkbrd_small_init.dat|./srng - 250 5 >chkbrd_small_pt.dat

the training data can be concatenated with the initialization file,
where the '-' indicates that the initialization is to be read from 
standard input.


  B. TEST MODE

The output of the srng snapshots can be used for testing:

> cat chkbrd_small.dat chkbrd_small_init.dat|./srng - 0 -1

This call used the two files available in the package. 
Usually, of course, one assumes a test set like chkbrd_small_tst.dat
instead of chkbrd_small.dat and a corresponding trained file with
prototypes chkbrd_small_pt.dat instead of the _initialization file.
So don't expect a good result for the line above.

First srng parameter (cycles) after - 
0  cycle indicates test mode
-1 cycle indicates test mode with elimination of prototypes with
     bad (<=50%) classification accuracy.

Second srng parameter (validate steps = moduloprint) after - 
Select record number n.
If n < 0, do no select special record, but print all of them.

If You add  
  | grep '^-1 '
to the above command line, only the overall classification accuracy is printed.
The grep '^-1 ' filters the line containing the overall accuracy on the
test set. Removing grep provides information about the
correct classification of each prototype.
NaN (not a number) is displayed for prototypes with empty receptive fields
for the given test set.


The command 

> cat chkbrd_small_tst.dat chkbrd_small_pt.dat|./srng - 0 42 "" | grep '^-1 '

would print the 42th record of the chkbrd_small_pt.dat prototypes. After
stripping the preceding prototypes statistic lines these lines
can be used as initialization file for further training.

The general appearance of the initialization file is given now.



                            Exemplary srng initfile
                            =======================

(comments) below in paretheses must not appear in the original file.

**************** chkbrd_small_init.dat **************************************

(i) 9(number_prototypes) 0.25(coord-adapt-rate) 1e-6(lambda-adapt-rate) 0.0(lambda-weight-decay) 0.95(neighborhood-decay-rate) -1(neighborhood-inititial-size) 0.5(ng2grlvq-fader-start) 0.5(ng2grlvq-fader-end) 0(random-seed)
(p) 0.655123 0.572721 1     (first prototype for class 1)
(p) 0.0654028 0.287699 1
(p) 0.486939 0.462719 1
(p) 0.124859 0.431446 1
(p) 0.908377 0.998554 1
(p) 0.815472 0.159003 0     (prototype order doesn't matter)
(p) 0.252102 0.160793 0     (class numbers should start with 0)
(p) 0.211522 0.975037 0     (class numbering should not contain gaps)
(p) 0.653517 0.400613 0
(l) -1

*****************************************************************************

(i) is the parameter initialization.

    number_prototypes: tell that this number of following lines are prototypes
    coord-adapt-rate: is usually between 0.0001 and 0.5
    lambda-adapt-rate: is usually between 0.0 and about coord-adapt-rate/1000
    lambda-weight-decay: is experimental between 0 and about 1e-6
    neighborhood-decay-rate: controls shrinking of neighborhood size per cycle
    neighborhood-inititial-size: 0 means number of prototypes which seems useful
    ng2grlvq-start: start of amount of wrong prototype pushing (0 => neural gas)
    ng2grlvq-end: end of influence (1 => closest wrong = closest correct amount)
    random seed: if 0 -> generate random seed, else given number taken as seed.

(p) prototype lines

(l) lambdas / metric weights (one or dimension of data)
    -1 means initialize all dimensions equally. Above example: -1 => 0.5 0.5
    after training, this line can be used for analysis of dimension relevances.


Note that in contrast to the previous program version there is no
line with block weights any more.



                            Format of the data files
                            ========================
                            
The first line of a data file contains

#number_of_lines dimension
dimension excludes the class label dimension, hence there are 
dimension+1 columns.
Feature vectors are arranged as space (ascii 32) separated real numbers.
The class label is in the last column. Class labels start with 0
and the numbering 0,1,2,...K must not contain gaps.

Example from chkbrd_small.dat:

#199 2
0.0833333 0.458333 0
0.0833333 0.5 0
0.0833333 0.541667 0
[...]
0.916667 0.833333 1
0.916667 0.875 1



                             some awk - helper files
                             =======================

augment.awk           generate a training file with equal class distribution
cnorm.awk             normalize each column to interval [0,1]
discrete2unary.awk    generate a real data vector representation of symbolic
protosfromdata.awk    obtain prototypes from training file
shuffle.awk           shuffles lines in a (data) file
toinput.awk           converts a data table into the correct srng format
trainer.awk           recommended for batch mode processing (X-validation, etc)
ztrans.awk            calculate a z-transform for columns of given data
ztrans_inv.awk        calculate a inverse z-transform for columns of given data
ztrans_apply.awk      z-transform columns of given data by result of ztrans.awk

See the file headers.

#EOF