Object-oriented Data Mining

This page gives a lot more detail about the work I did during my Ph.D. studies.

Simon Rawles. Object-oriented Data Mining. PhD thesis, Department of Computer Science, University of Bristol, July 2007.


Attempts to overcome limitations in the attribute-value representation for machine learning have led to much interest in learning from structured data, concentrated in the research areas of inductive logic programming (ILP) and multi-relational data mining (MDRM). The expressiveness and encapsulation of the object-oriented data model has led to its widespread adoption in software and database design. The considerable congruence between this model and individual-centred models in inductive logic programming presents new opportunities for mining object data specific to its domain.

The thesis investigates the use of object-orientation in knowledge representation for multi-relational data mining. The thesis proposes a language for expressing object model metaknowledge and uses it to extend the reasoning mechanisms of an object-oriented logic. A refinement operator is then defined and used for feature search in a object-oriented propositionalisation-based ILP classifier. An algorithm is proposed for reducing the large number of redundant features typical in propositionalisation. A data mining system based on the refinement operator is implemented and demonstrated on a real-world computational linguistics task and compared with a conventional ILP system.


The core contribution of the thesis is the CoSinus algorithm and its underlying CorLog logical framework. CoSinus is both an algorithm and data mining system for inductive logic programming under the object model, in particular a feature search for propositionalisation using an object-specific refinement operator. CorLog is an extension of an existing object-oriented logical framework which extends the logic programming basis of ILP with a new set of features to guide induction, and define a strong bias, leading to a smaller hypothesis space but still retaining the expressiveness of the object logic. The object model gives rise to a much more sophisticated form of type-safety and type-correctness, involving conjunctions of simple, parametric and abstract classes and constraints on co-existing classes, as well as mechanisms to abstract existing structure in terms of these classes. Relations may take ordered arguments, extending the basis of value restriction in search from classes to constraints involving constants, such as simple integer inequalities. The framework gives rise to redefinition of valid substitution under the new data model, informing a moded, classed refinement operator which has desirable optimality properties, implemented in a three-tiered (feature, rule, theory) search taking advantage of the natural data encapsulation of the object model. Further aspects of the object model are exploited to arrive at search bounds derived from the object model. While there are other algorithms which perform inductive logic programming, and specifically those which propose object-like data mining in class hierarchies, no system known to the author at the time of writing is as comprehensive in its coverage of the object-oriented data model or its use of this data model in constraining induction as closely, and it is in this aspect that much of the innovation lies. Such an approach to induction is immediately useful not only for domains with a strong notion of individual but also where objects naturally belong in a class taxonomy — particularly ones which are defined at multiple levels of granularity — or which adopt complex datatypes modelled independent of their constituents. The thesis argues and demonstrates that for the de-facto forms of domain description in use in ILP today, such data is inadequately modelled but may be exploited for induction.

The second contribution is the Refer algorithm, which overcomes the often prohibitively high dimensionality of the feature sets produced by propositionalisation. This is of clear interest to the thesis, since the core inductive logic programming technique of CoSinus is propositionalisation. Refer is a post-processing feature reduction stage for feature sets containing Boolean examples labelled by one of any number of possible class labels. Refer is therefore general to the large class of learners producing such data and uses the notion of logical coverage to efficiently determine the logical redundancy of a feature for building a classification rule. Similar feature reduction algorithms have been proposed. However, the innovation here lies in the partitioning of examples into a set of disjoint subsets whose examples are mutually close in terms of Hamming distance. Comparisons between each of these subsets lead to a greater reduction than comparing the example set as a whole. Its usefulness to the topic of the thesis is that it supplements the redundancy-reducing mechanisms of CoSinus, which use a metaknowledge-based definition of redundancy, and further filters logically redundant hypotheses before presentation to the learner. A number of variants are presented and analysed, and improvements in predictive performance, quality of induced theories and reductions in induction time are demonstrated.

The third main contribution is the application of machine learning to a learning task from the field of computational linguistics. A detailed linguistic corpus is preprocessed with a sophisticated, domain-specific pre-processor to produce a highly-structured representation of thousands of English-language sentences, including their parse trees (the principal compositional structure), grammatical categories (the principal inheritance structure), parts of speech, etc. A complex domain library is defined which allows querying of these sentences in a highly object-oriented manner encompassing the modelling constructs of CorLog and CoSinus as well as augmenting it with additional, derived structure via additional rules. Two inductive logic programming systems, one which uses the object model and one which does not, are used to solve the problem of labelling the nodes within a sentence with its grammatical function. In particular, the thesis studies the labelling of a node as the subject of a sentence, or otherwise. Similar problems have been approached many times in computational linguistics, but typically by statistical methods, rather than the symbolic, structural approach taken by ILP. Theories expressed in symbolic logic are of interest to linguists, since they directly describe linguistic rules for these important lingustic features. More generally, models for labelling grammatical functions of words are immediately useful for document analysis and retrieval applications. The work tests the thesis' claims in a practical setting by establishing a mapping between the framework of the object model and the conventional logic programming basis and inductive bias of existing ILP learners. Results from applying the learners under various parameter settings and combinations of available background knowledge are compared, showing that by using the object-oriented data model for representation and induction, a higher-quality set of hypotheses is search and more comprehensible models with better predictive accuracy are produced compared to an existing, established ILP learner. Furthermore, it is shown that object-oriented data mining is a viable approach for real-world problems in computational linguistics, and in particular the labelling of grammatical functions.