Sentence Chunking


Sentence chunking is the task of splitting a sentence into its main chunks. In this example, we evaluated ILASP on a sentence chunking dataset that has previously been used to evaluate the INSPIRE system (Kamzi et al. 2017).

According to the dataset, this sentence should be split into three chunks.

As sentences are unstructured data, in order to use a logic-based system, we need to do some preprocessing. In (Kamzi et al. 2017), the dataset was processed using the Stanford CoreNLP library, to convert each sentence into a set of Part Of Speech (POS) tags, as shown in the image below.

The POS tagging of the sentence generated by The Stanford CoreNLP library.

For this experiment, we used the same sets of POS tags generated in (Kamzi et al. 2017). The task is then to learn rules based on these POS tags to determine where to split each sentence.

As this is a very noisy dataset, the goal was not to find a set of rules which covered all examples, but to find a set that explained the majority of them. We compared ILASP to INSPIRE on the two versions of the dataset. The results are shown below – there are 5 categories of sentences, and two versions of the dataset for each category, one with 100 sentences and the other with 500 sentences. Each result is the average F1 score resulting from 11-fold cross validation.

100 Examples
Dataset INSPIRE F1 score ILASP F1 score
Headlines S1 73.1 74.2
Headlines S2 70.7 73.0
Images S1 81.8 83.0
Images S2 73.9 75.2
Students S1/S2 67.0 72.5
500 Examples
Dataset INSPIRE F1 score ILASP F1 score
Headlines S1 69.7 75.3
Headlines S2 73.4 77.2
Images S1 75.3 80.8
Images S2 71.3 78.9
Students S1/S2 66.3 75.6

ILASP performed better than INSPIRE in every category. One explanation for this is that ILASP is guaranteed to find an optimal solution to the learning problem, whereas INSPIRE only approximates the optimisation problem. Furthermore, as part of INSPIRE's algorithm uses a time limit, on larger datasets, the approximation can be worse than on smaller datasets. This is noticeable in the results — in 4 out of the 5 categories, INSPIRE performs better with 100 examples than with 500 examples. On the other hand, ILASP performs better with 500 examples in 4 out of the 5 categories, which is closer to what one would expect – if a learner is given more data, it should perform better on average.