K.U.Leuven's second submission to PTE2

Warmr and Maccent with original Progol background


Consortium

The Machine Learning group of the Katholieke Universiteit Leuven

Name

Luc Dehaspe

Address

Luc Dehaspe
Department of Computer Science
Celestijnenlaan 200A
B-3001 Heverlee, Belgium

Materials

(Give a brief description of the program(s) used, the machines that they were run on, the background knowledge used etc here. IMPORTANT: Also include pointers to the input files used by your program. This is important for someone who would like to replicate your experiments.)

The system WARMR was used to find frequent substructures. These were then used as binary features in MACCENT. Both programs run on Sun machine equiped with the Solaris operating system. For the experiments described hereafter, we mainly used a SPARC Ultra-2.

Data files (both examples and background) were ftp'ed from the PTE page (Some relevant information, Datasets:Data used in our ILP experiments) and converted to WARMR format.

The data files in WARMR format, together with a settings file and the output produced by WARMR, can be extracted from this file.

Method

(Give a brief description of the experimental methodology used here. IMPORTANT: this should include the options selected for the programs used etc. This is important for someone who would like to replicate your experiments.)

The classifier was built in two stages: a feature selection step (with WARMR) and a model selection step (with MACCENT).

(a) Feature selection with WARMR

WARMR is general purpose ILP algorithm for finding frequent queries in databases. The task of frequent pattern discovery is best known in its simplest form from association rule mining, with a prototypical application example in market basket analysis: find out which products tend to be sold together. In WARMR a more expressive first-order formalism is used for representing patterns: every pattern is described by means of a Prolog query and the task is to find queries that frequently succeed. In the PTE application, this comes down to finding frequent substructures in the 337 compounds. The frequency threshold was set to 10%.

Once frequent queries and their frequencies are discovered, probabilistic rules can be produced, much like in the case of association rules (examples are given below). Alternatively these queries can be used as binary features (they either succeed or fail with respect to an example) with your favourite propositional learner, preferrably one that can handle a large set of highly dependent features (see (b) model selection with MACCENT).

We randomly split the set of 337 compounds into 2/3 for the discovery of frequent substructures, and 1/3 for the validation of derived probabilistic rules about carcinogenicity. The language bias used for this experiment included all information except the Ashby alerts. The output file with all 17049 frequent queries is here. These frequent queries were further processed into rules of the type IF substructure(C) THEN (non)carcinogenic(C). With a binomial test, these rules were ranked according to their unusualness (i.e. how much their accuracy deviates from the accuracy of IF substructure(C) THEN (non)carcinogenic(C)). The 215 rules rules that deviate more then 3*sigma are listed here, with full information their accuracies on the 2/3 train set 1/3 validation set and the overall set of 337 compounds. This ordered list of 215 rules is the interpretable result of our experiments. These rules can be further grouped and condensed into 5 sets of variants: 3 with explanations for pos set1 set2 set3 and 2 with explanations for neg set4 set5.

(b) Model selection with MACCENT

In a second step we used the 17049 frequent queries produced by WARMR as binary features in MACCENT. MACCENT is a statistical modeling tool based on maximum entropy. In this approach a maximum-likelihood strategy is followed to fit an exponential model to trainingdata. Frequencies of features impose constraints on the model. The general idea of maximum entropy modeling is to construct a model that meets these constraints but is otherwise as uniform as possible. The input data, in a standard C4.5 format, and the output of MACCENT can be found here

Results

(Give a description of the theories used to predict classifications for PTE-1 and/or PTE-2. This should include Prolog descriptions, output of any propositional learner etc. Also include any information on the time taken to construct the theories, storage requirements etc.)

Feature selection took about 5 hours of cpu time. Model selection (exclusive preparation of the data in C4.5 format) took about 1 minute.

As mentioned above, the output of WARMR, and more specifically the 215 ranked probabilistic rules shown here and in a more condensed way in set1 set2 set3 set4 set5 are meant to contribute to scientific insight. The MACCENT model on the other hand is merely meant for prediction.

The output of MACCENT is a distribution over both classes, rather than a single class. For this challenge we have selected the most probable class as the predicted one, but the ranking shown below according to the probability of being carcinogenic might also make sense.

Warmr+Maccent ranked PTE-2 predictions (most probable + on top)

CAS Id

Prolog Id

Compound Name

Prediction

Probability +

8003-22-3

t10

D & C YELLOW NO. 11

+

0.606

127-00-4

t13

1-CHLORO-2-PROPANOL

+

0.588

126-99-8

t8

CHLOROPRENE

+

0.574

84-65-1

t25

ANTHRAQUINONE

+

0.573

7632-00-0

t28

SODIUM NITRITE

+

0.566

1303-00-0

t21

GALLIUM ARSENIDE

+

0.558

78-84-2

t11

ISOBUTYRALDEHYDE

+

0.553

75-52-8

t4

NITROMETHANE

+

0.552

1313-27-5

t12

MOLYBDENUM TRIOXIDE

+

0.55

1314-62-1

t30

VANADIUM PENTOXIDE

+

0.5483

115-11-7

t22

ISOBUTENE

+

0.5482

104-55-2

t29

CINNAMALDEHYDE

+

0.546

10026-24-1

t9

COBALT SULFATE HEPTAHYDRATE

+

0.543

518-82-1

t26

EMODIN

+

0.536

125-33-7

t19

PRIMACLONE

+

0.534

5392-40-5

t27

CITRAL

+

0.529

100-41-4

t7

ETHYLBENZENE

+

0.527

110-86-1

t16

PYRIDINE

+

0.526

109-99-9

t5

TETRAHYDROFURAN

+

0.524

98-00-0

t18

FURFURYL ALCOHOL

+

0.506

93-15-2

t23

METHYLEUGENOL

-

0.486

147-47-7

t3

1,2-DIHYDRO-2,2,4-TRIMETHYQUINOLINE

-

0.48

76-57-3

t2

CODEINE

-

0.469

1300-72-7

t17

XYLENESULFONIC ACID

-

0.461

1948-33-0

t6

T-BUTYLHYDROQUINONE

-

0.452

111-76-2

t20

ETHYLENE GLYCOL MONOBUTYL ETHER

-

0.446

11-42-2

t14

DIETHANOLAMINE

-

0.439

77-09-8

t15

PHENOLPHTHALEIN

-

0.436

434-07-1

t24

OXYMETHOLONE

-

0.416

6533-68-2

t1

SCOPOLAMINE HYDROBROAMIDE

-

0.376

Comment

Both WARMR and MACCENT are available for academic purposes upon request.

References

Warmr

Maccent

Feature construction with ILP