Examples

In this example, we show how to use our package to run training and testing on a casino gaming example when there is only one observation for each sample point.

In the toy example, we have 2 dice, a fair(unbiased) one which has equal probability to cast each number from 1 to 6, and a fake(biased) one which has higher probability to cast number 6. The task is given a sequence of the casted number, we want to find out at each position which of the two dice actually cast each number in the sequence.

We list the training data, testing data and the feature file that includes all the features used in the model below.

  • Training data:
    Observation sequence (obs_train.txt)
    State sequence (state_train.txt)
  • Testing data:
    Observation sequence (obs_test.txt)
    State sequence (state_test.txt)
  • Feature file:
    feature file (feature)
  • Run training:
    We first train our model on the given training data, and the trained parameters are stores in the user specified file, in our case, final_parameters.txt
    ./crf_FVFF(crf_OSFF) -t train_list.txt -f feature.txt final_parameters.txt
    where the file train_list.txt lists all the training sequences included in the training process.
  • Run testing:
    The trained parameters stored in final_parameters.txt are used to predict on unseen testing data, and the prediction result are stored in file predict_result. The program also outputs the predicting accuracy on screen.
    ./crf_FVFF(crf_OSFF) -p test_list.txt -f feature.txt final_parameters.txt
    where the file test_list.txt lists all the testing sequences we want to predict.


In this example, we show how to use our package to run training and testing on a text chunking problem when there are multiple observations for each sample point.

Our example files are based on a text chunking problem. The data is from the CoNLL shared task. Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence

He reckons the current account deficit will narrow to only # 1.8 billion in September .

can be divided as follows:

[ He ][ reckons ][ the current account deficit ][ will narrow ][ to ][ only # 1.8 billion ][ in ] [ September ] .

Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. The O chunk tag is used for tokens which are not part of any chunk.

In this example, for each sample point, we have two observations, the word itself, the its POS tag.

  • Training data:
    Observation 1 (sentence.txt)
    Observation 2 (POS.txt)
    State sequence (chunk.txt)
  • Feature file:
    feature file (feature_multi)
  • Run training:
    ./crf_FVFF -t train_multi_list.txt -f feature.txt -n 2 final_parameters.txt
    where the '-n' option specifies the number of observations for each sample point, in this example n = 2;
    and the file train_multi_list.txt lists all the training data in the case of multiple observations. Note that, the two observations go before the state sequence in the list.