CW 390

Stefan Raeymaekers, Maurice Bruynooghe, Jan Van den Bussche
Learning (k,1)-contextual tree languages for information extraction

Abstract

Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the ($k$,$l$)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches.

report.pdf (171K) / mailto: S. Raeymaekers