The system WARMR was used to find frequent substructures. These were then used as binary features in MACCENT. Both programs run on Sun machine equiped with the Solaris operating system. For the experiments described hereafter, we mainly used a SPARC Ultra-2.
Data files (both examples and background) were ftp'ed from the PTE page (Some relevant information, Datasets:Data used in our ILP experiments) and converted to WARMR format.
The data files in WARMR format, together with a settings file and the output produced by WARMR, can be extracted from this file.
The classifier was built in two stages: a feature selection step (with WARMR) and a model selection step (with MACCENT).
WARMR is general purpose ILP algorithm for finding frequent queries in databases. The task of frequent pattern discovery is best known in its simplest form from association rule mining, with a prototypical application example in market basket analysis: find out which products tend to be sold together. In WARMR a more expressive first-order formalism is used for representing patterns: every pattern is described by means of a Prolog query and the task is to find queries that frequently succeed. In the PTE application, this comes down to finding frequent substructures in the 337 compounds. The frequency threshold was set to 10%.
Once frequent queries and their frequencies are discovered, probabilistic rules can be produced, much like in the case of association rules (examples are given below). Alternatively these queries can be used as binary features (they either succeed or fail with respect to an example) with your favourite propositional learner, preferrably one that can handle a large set of highly dependent features (see (b) model selection with MACCENT).
We randomly split the set of 337 compounds into 2/3 for the discovery of frequent substructures, and 1/3 for the validation of derived probabilistic rules about carcinogenicity. The language bias used for this experiment included all information except the Ashby alerts. The output file with all 17049 frequent queries is here. These frequent queries were further processed into rules of the type IF substructure(C) THEN (non)carcinogenic(C). With a binomial test, these rules were ranked according to their unusualness (i.e. how much their accuracy deviates from the accuracy of IF substructure(C) THEN (non)carcinogenic(C)). The 215 rules rules that deviate more then 3*sigma are listed here, with full information their accuracies on the 2/3 train set 1/3 validation set and the overall set of 337 compounds. This ordered list of 215 rules is the interpretable result of our experiments. These rules can be further grouped and condensed into 5 sets of variants: 3 with explanations for pos set1 set2 set3 and 2 with explanations for neg set4 set5.
Feature selection took about 5 hours of cpu time. Model selection (exclusive preparation of the data in C4.5 format) took about 1 minute.
As mentioned above, the output of WARMR, and more specifically the 215 ranked probabilistic rules shown here and in a more condensed way in set1 set2 set3 set4 set5 are meant to contribute to scientific insight. The MACCENT model on the other hand is merely meant for prediction.
The output of MACCENT is a distribution over both classes, rather than a single class. For this challenge we have selected the most probable class as the predicted one, but the ranking shown below according to the probability of being carcinogenic might also make sense.
Warmr+Maccent ranked PTE-2 predictions (most probable + on top)
CAS Id |
Prolog Id |
Compound Name |
Prediction |
Probability + |
8003-22-3 |
t10 |
D & C YELLOW NO. 11 |
+ | 0.606 |
127-00-4 |
t13 |
1-CHLORO-2-PROPANOL |
+ | 0.588 |
126-99-8 |
t8 |
CHLOROPRENE |
+ | 0.574 |
84-65-1 |
t25 |
ANTHRAQUINONE |
+ | 0.573 |
7632-00-0 |
t28 |
SODIUM NITRITE |
+ | 0.566 |
1303-00-0 |
t21 |
GALLIUM ARSENIDE |
+ | 0.558 |
78-84-2 |
t11 |
ISOBUTYRALDEHYDE |
+ | 0.553 |
75-52-8 |
t4 |
NITROMETHANE |
+ | 0.552 |
1313-27-5 |
t12 |
MOLYBDENUM TRIOXIDE |
+ | 0.55 |
1314-62-1 |
t30 |
VANADIUM PENTOXIDE |
+ | 0.5483 |
115-11-7 |
t22 |
ISOBUTENE |
+ | 0.5482 |
104-55-2 |
t29 |
CINNAMALDEHYDE |
+ | 0.546 |
10026-24-1 |
t9 |
COBALT SULFATE HEPTAHYDRATE |
+ | 0.543 |
518-82-1 |
t26 |
EMODIN |
+ | 0.536 |
125-33-7 |
t19 |
PRIMACLONE |
+ | 0.534 |
5392-40-5 |
t27 |
CITRAL |
+ | 0.529 |
100-41-4 |
t7 |
ETHYLBENZENE |
+ | 0.527 |
110-86-1 |
t16 |
PYRIDINE |
+ | 0.526 |
109-99-9 |
t5 |
TETRAHYDROFURAN |
+ | 0.524 |
98-00-0 |
t18 |
FURFURYL ALCOHOL |
+ | 0.506 |
93-15-2 |
t23 |
METHYLEUGENOL |
- | 0.486 |
147-47-7 |
t3 |
1,2-DIHYDRO-2,2,4-TRIMETHYQUINOLINE |
- | 0.48 |
76-57-3 |
t2 |
CODEINE |
- | 0.469 |
1300-72-7 |
t17 |
XYLENESULFONIC ACID |
- | 0.461 |
1948-33-0 |
t6 |
T-BUTYLHYDROQUINONE |
- | 0.452 |
111-76-2 |
t20 |
ETHYLENE GLYCOL MONOBUTYL ETHER |
- | 0.446 |
11-42-2 |
t14 |
DIETHANOLAMINE |
- | 0.439 |
77-09-8 |
t15 |
PHENOLPHTHALEIN |
- | 0.436 |
434-07-1 |
t24 |
OXYMETHOLONE |
- | 0.416 |
6533-68-2 |
t1 |
SCOPOLAMINE HYDROBROAMIDE |
- | 0.376 |
Both WARMR and MACCENT are available for academic purposes upon request.