Sentence chunking is the task of splitting a sentence into its main chunks. In this example, we evaluated ILASP on a sentence chunking dataset that has previously been used to evaluate the INSPIRE system (Kamzi et al. 2017).
As sentences are unstructured data, in order to use a logic-based system, we need to do some preprocessing. In (Kamzi et al. 2017), the dataset was processed using the Stanford CoreNLP library, to convert each sentence into a set of Part Of Speech (POS) tags, as shown in the image below.
For this experiment, we used the same sets of POS tags generated in (Kamzi et al. 2017). The task is then to learn rules based on these POS tags to determine where to split each sentence.
As this is a very noisy dataset, the goal was not to find a set of rules which covered all examples, but to find a set that explained the majority of them. We compared ILASP to INSPIRE on the two versions of the dataset. The results are shown below – there are 5 categories of sentences, and two versions of the dataset for each category, one with 100 sentences and the other with 500 sentences. Each result is the average F1 score resulting from 11-fold cross validation.
Dataset | INSPIRE F1 score | ILASP F1 score |
---|---|---|
Headlines S1 | 73.1 | 74.2 |
Headlines S2 | 70.7 | 73.0 |
Images S1 | 81.8 | 83.0 |
Images S2 | 73.9 | 75.2 |
Students S1/S2 | 67.0 | 72.5 |
Dataset | INSPIRE F1 score | ILASP F1 score |
---|---|---|
Headlines S1 | 69.7 | 75.3 |
Headlines S2 | 73.4 | 77.2 |
Images S1 | 75.3 | 80.8 |
Images S2 | 71.3 | 78.9 |
Students S1/S2 | 66.3 | 75.6 |
ILASP performed better than INSPIRE in every category. One explanation for this is that ILASP is guaranteed to find an optimal solution to the learning problem, whereas INSPIRE only approximates the optimisation problem. Furthermore, as part of INSPIRE's algorithm uses a time limit, on larger datasets, the approximation can be worse than on smaller datasets. This is noticeable in the results — in 4 out of the 5 categories, INSPIRE performs better with 100 examples than with 500 examples. On the other hand, ILASP performs better with 500 examples in 4 out of the 5 categories, which is closer to what one would expect – if a learner is given more data, it should perform better on average.