Readme

1) How to compile the CRF-OPT package?

2) How to run the package?

3) How to specify your Observation/State sequence?

4) How to specify your training/testing list?

5) How to specify each feature for your application?

6) How to specify your FEATURE file?

1) How to compile the CRF-OPT package?

CRF-OPT is built upon the Toolkits for Advanced Optimization (TAO) developed by the Argonne National Laboratory. The package offers a great number of high performance algorithms for large-scale optimization problems and it's suitable for both single-processor and massively-parallel architectures. We build our package on top of TAO, so the user would be able to change the optimization algorithms if it is desired.

Here's a link to the webpage on how to set up the TAO package

http://www-unix.mcs.anl.gov/tao/documentation/installation.html

We summarize the instruction as follows (The instructions are intended to be run from unix command line, for winders user, cygwin environment would be required, please refer to above link for more detail):

  1. Installing PETSC (Portable, Extensible Toolkit for Scientific Computation)
    • Download the package via (version 2.3.2 recommended )
      http://www.mcs.anl.gov/petsc/petsc-2/download/index.html
    • Unbundle the package:
      gunzip -c petsc-2.3.2-p0.tar.gz | tar -xof -
    • Setup the environment:
      cd petsc-2.3.2-p0; export PETSC_DIR=$PWD (for sh/bash shell)
      cd petsc-2.3.2-p0; setenv PETSC_DIR=$PWD(for csh/tcsh shell)
    • Configure the package:
      ./config/configure.py --with-mpi=0 --with-clanguage=c++ --download-f-blas-lapack
    • Build up the PETSC library:
      make all
  2. Installing TAO
    • Download the newest version via (version 1.8.2 recommended)
      http://www-unix.mcs.anl.gov/tao/download/index.html
    • Unbundle the package
      gunzip -c tao-1.8.2.tar.gz | tar <96>xvf <96>
    • Setup the environment
      cd tao-1.8.2; export TAO_DIR=$PWD (for sh/bash shell)
      cd tao-1.8.2; setenv TAO_DIR=$PWD(for csh/tcsh shell)
    • Build up the TAO library
      make all
  3. Compiling CRF-OPT
    • Replace the makefile under the tao-1.8.2/example/ folder by the one in the package, and put crf-FVFF.c, crf-OSFF.c under the same folder
    • Compile
      make crf-FVFF
      make crf-OSFF

We also included in the package the compiled executibles for both 32-bit and 64-bit machine, so the user can use them directly without the installation

BACK TO TOP

2) How to run the package?

Usage:    ./crf_FVFF(crf_OSFF)    [options]    parameter_file
Options:
-t    --training-list {CRF training list}      //Train CRF on the examples given in the list;
-p    --predict-list {predict list}      //Make predictions for each example given in the list;
-f    --feature-list {feature list}      //Feature file;
-n    --number {number of observations, only applies to crf_FVFF}      //Set the number of observations included for each example (default is 1);
For some applications, it is important to be able to include multiple observations to aid the learning process. Hence, we also enhanced our first algorithm Feature Vector Fast Forward (FVFF) to accommodate such needs. We include an "-n" option for user to specify the number of observation sequences they want to included for each training/testing exaple. If there's only one observation sequence, this option could be omitted, or simply set it to 1.
-r    --regularization factor {regularizer}      //Set the regularization coefficient. (default is 0);
As mentioned in the paper, in Conditional Random Fields, usually a regularization factor is included in the model to avoid overfitting. The "-r" option is used to specify the regularization factor.
-e    --exponential transformation {exponential_transformation}      //Whether or not to perform another round of exp transformation;
As mentioned in the paper, we proposed an extra round of exponential transformation of the original CRF objective funciton to avoid the early termination problem in CRF training and further improve the training accuracy. The "-e" option is used to specify whether or not another round of optimization is desired.
Basic command for running training:
./crf_FVFF(crf_OSFF) -t train_list.txt -f feature final_parameter.txt
The trained parameter would be stored in file "final_parameter.txt".

Basic command for running testing:
./crf_FVFF(crf_OSFF) -p test_list.txt -f feature final_parameter.txt

The trained parameters stored in "final_parameter.txt" is used to predict new sequences. And the predicting result would be stored in file "predict_result".

BACK TO TOP

3) How to specify your Observation/State sequence?

Put each of the observation/state sequences in one separate file. Each entity of the sequences should be separated by a single space. As shown in file obs_train.txt, and state_train.txt. If more than one observation sequences will be supplied for each example, put each of the observation sequences in one file. As shown in the file sentence.txt, and POS.txt

BACK TO TOP

4) How to specify your training/testing list?

Put the locations of each of your training/testing samples in one line, and the observation sequences goes first, as shown in file train_list.txt/test_list.txt , or train_multi_list.txt for the case of multiple observations.

BACK TO TOP

5) How to specify each feature for your application?

Every feature for CRF-OPT training is stored in a vector. for example:
The transition feature of genome sequence from NONCODING(NC) to EXON(CD0) with obervation "ATG" can be formulated as
f(y_{t-1}=NC, y_t=CD0, x_t=A, x_{t+1}=T, x_{t+2}=G)
and we use a vector F to represent this feature
0 NC CD0 A T G
where
  • F[0] specifies when the observation factors start before time t(a negative number means the observation factor starts after time t) in the feature;
    For the above example, the obervation factors include "x_t=A","x_{t+1}=T" and "x_{t+2}=G", where the one starts earliest is "x_t=A", starting at time t, hence, F[0]=0;
    If there's a feature f(y_{t-1} = INTRON, y_t=CD1, x_{t-2}=A, x_{t-1}=G), then F[0] would be 2
    Similarly, if there's a feature f(y_{t}= FAKE, x_{t+1}=6, x_{t+2}=6, then F[0] would be -1
  • F[1] specifies previous state;
  • F[2] specifies current state
  • F[3]~F[] specifies the observation factors, starting with the left most one.
  • Each position within F[3]~F[] corresponds to a time slot.
    For those positions not defined in the feature, represent it using "*".
    For example,
    The 5th order feature of genome sequence, in state EXON(CD1), previously 5 bases are AATTC, current base is G, can be formulated as
    f(y_t = CD1, x_{t-5}= A, x_{t-4}=A, x_{t-3}=T, x_{t-2} = T, x_{t-1}=C, x_t=G)
    It should be represented in a vector as
    5 * CD1 A A T T C G
    Similarly, for a feature capturing the fact that, at time t an even dice was switched by a fake dice, while we saw the rolling numbers are 1,2,4 at time t-3,t-2,t-1, and 6,6 at time t+1, t+2 can be represented in a vector as
    3 EVEN FAKE 1 2 4 * 6 6
The following examples show how to specify the features for application with multiple observation sequences.
For instance, for a POS tagging problem, a feature
f(y_t = NOUNS, x_t = "Education", x_t = Aa+, x_t = '-tion')
where Aa+ means it's capitalized, and '-tion' is the suffix, could simply be captured in one vector
0 * NOUNS Education;Aa+;'-tion'
where each observation is separated by a ';'.
To give you another example, a feature
f(y_{t-1} = LOC, y_t = ORG, x_{t-2} = "New", x_{t-2} = Aa+, x_{t-2} = *, x_{t-1} = "York", x_{t-1} = Aa+, x_{t-1} = *, x_{t} = "Times", x_{t} = Aa+, x_{t} = '-es' )
could be represented as
2 LOC ORG New;Aa+;* York;Aa+;* Times;Aa+;'-es'
We do require that, at each time slot, if any observation is missing, it should be filled with '*'.
Also, since ';' is reserved as a separator between different observations, it should not appear in any of the features.

BACK TO TOP

6) How to specify your FEATURE file?

There're two parts in the feature file.
  • In the first part, starting with header "[STATE]",user should write out every possible state in their application and its initial weight in one line.
  • In the second part, starting with header "[FEATURE]", user should put each feature, specified following rules in section 5, in one line.
Please refer to the feature files in the example page.

BACK TO TOP