Generating a Condensed Representation for Positive and Negative Association Rules

Given a large collection of transactions containing items, a basic common association rules problem is the huge size of the extracted rule set. Pruning uninteresting and redundant association rules is a promising approach to solve this problem. In this paper, we propose a Condensed Representation for Positive and Negative Association Rules representing non-redundant rules for both exact and approximate association rules based on the sets of frequent generator itemsets, frequent closed itemsets, maximal frequent itemsets, and minimal infrequent itemsets in database B. Experiments on dense (highly-correlated) databases show a significant reduction of the size of extracted association rule set in database B.


Introduction and Motivations
Positive and negative association rules (PNAR) mining have been studied extensively in Data mining problem. Let X and Y be two disjoints itemsets, an association rule X → Y states that a significant proportion in database B containing items in the premise (or antecedent) X also contain items in the consequent (or conclusion) Y . This rule can indicate the positive relations between different items, is called positive association rule (PAR) in database B. the association rule at other three forms X → Y , X → Y and X → Y , which can indicate the negative relations between items in database B, are called negative association rules (NAR) in database B.
A basic common association rules problem is the huge number of association rules generated many of which are uninteresting (Definition 1) and redundant (Definition 2). Many approaches [13], [14], [16], based on traditional measure confidence [1], has been developed for reducing the size of the extracted rule set. However, no method to prune uninteresting association rules (UAR) has been found in the literature. Indeed, this classic measure confidence is not efficient to prune uninteresting rules. In addition, these approaches are insufficient, because they consider only the positive association rules, and this, with less selective pair supportconfidence [1]. Therefore, discovering NAR, which can be interest to several domains [4], [6], [11], [15] such as Artificial Intelligence, Machine Learning, Data Mining, Big Data, Visualization, Marketing, Web mining, etc, is much more less developed than PAR due to the significant problem complexity caused by high computational cost and huge search space in calculating NAR candidates.
In this paper, we propose a Condensed Representation representing non-redundant positive and negative association rules based on generator itemsets, closed itemsets, maximal itemsets and minimal infrequent itemsets. The main contributions are summarized as follows. 1) We propose GC2M algorithm for mining simultaneously all frequent generators, all frequent closeds, all maximal frequent itemsets, and all minimal infrequent itemsets. GC2M is an abbreviation of Generator itemsets, Closed itemsets, Maximal itemsets, and Minimal infrequent itemsets. 2) We introduce a formal definition for uninteresting association rules (UAR), then propose an efficient strategy for pruning UAR using M GK measure [7]. 3) We propose an efficient strategy for search space pruning. 4) We propose three new efficient bases based on M GK measure : Concise Basis for Positive Approximate Rules (CBA), Concise Basis for Negative Exact Rules (CBE − ), and Concise Basis for Negative Approximate Rules (CBA − ). We prove that these concise bases are a lossless representation of non-redundant rules since all valid rules can be derived from these (cf. Theorems 2, 3, 4 and 5). 7) Based on these formalizations, we develop an efficient algorithm, called CONCISE, to discover non-redundant rules.
This paper is organized as follows. Section 2 discusses the related works. Section 3 gives the basic concepts. A Condensed Representation for PNARs is detailed in Section 4. Section 5 presents the experimental results. Conclusion and future work are given in Section 6.

Related works
The approaches of association rules mining can be roughly divided into two categories: i Bases of positive association rules, and ii Bases of negative association rules.
In positive basis, we present Duquenne-Guigues basis [10]. Without going into the details of its calculation, this approach is not informative. Bastide's approach [2] adapts Duquenne-Guigues basis. However, it inherits the same flaws as Guigues's approach [10]. In [13], the authors define two bases: Exat Min-Max Association Rules and Approximate Min-Max Association Rules. Despite their indisputable interests, these two bases contain UARs, and not complete (i.e. they do not generate the negative association rules). In [14], Pasquier defines two bases: Generic Base for Exact Rules and Generic Base for Approximate Rules. However, this approach is still incomplete and not optimal: it extracts only the positive association rules, many of which are UARs due to the confidence. Xu's approach [16] also extends Pasquier's approach [13], and defines two bases: Reliable Approximate Basis and Reliable Exact Basis, using CF (Certainty Factor). Similar to Pasquier's approach [14], Xu's approach [16] is also incomplete, it only considers positive rules, don't consider negative association rules.
In negative basis, it is important to mention that the extraction of negative rules is less developed compared to that of positive rules. Note that it emerges from the bibliographic study conducted so far that Feno's approach [7] is the first approach to have studied the problem of bases for negative rules. It extends the Pasquier's approach [13], and defines four bases: Basis for Exact Positive rules (BP E), Basis for Approximate Positive rules (BP A), Basis for Exact Negative Rules (BN E) and Basis for Approximate Negative Rules (BN A). However, this approach is not informative, because it selects the premises from a positive borders [12] (or pseudo-closed [13]) which intuitively returns the maximal elements, not in accordance with the notion of minimal premise. It is not very selective due to the use of critical value (cf. Equation (4)) when selecting valid rules. In addition, its formulation of negative exact rules is not appropriate which can present a high memory for searching space. Recently, Dong et al. [5] propose an efficient method for pruning redundant negative and positive rules, using Confidence and Correlation coefficient. Similar to Pasquier's approach, no methods to prune UARs has been found. In particular, Dong's approach does not consider a concept of bases for nonredundant rules, then its configuration semantics is not comparable to our approach.
From this quick literature, mining informative association rules is still a major challenge, for several reasons. On the one hand, the majority of existing approaches are limited on positive association rules which are not sufficient to guarantee the interest of knowledge extraction. On the other hand, these approaches are also limited on classic pair support-confidence [1] which produces a high number of association rules whose interest is not always guaranteed. In association rules problem, a Database (cf. Table 1) is a triplet B = T , I, R. T and I are finite sets of transactions and items respectively. R ⊆ T × I is a binary relation between T and I. A relation iRt denotes that the item i satisfies the transaction t. Let X ⊆ I,

Basic concepts
Both functions ϕ and ψ form a Galois connection between PI and PT [8], where PO is a power set of O. The composite function γX = ψoϕX is called Galois closure operator. Let X, Y ⊆ I, the support of X is defined as suppX = P X ′ = |X ′ | |T | , where P is a discrete probability. The support and confidence [1] We define the set FC of all frequent closed itemsets in database B as: FC = {C ∈ I|C = γC, suppC ⩾ minsup}. An itemset G is said a minimal generator of a closed C iff γG = C and g ⊆ I with g ⊆ G such that γg = C. We define the set G C of all frequent generators as:

Condensed Representations for PNARs
Our approach is divided into two successive steps: (i) it extracts the set FC, MFC, G γ. , and the set F MIN of minimal infrequent itemsets in B; (ii) it derives from these frequent sets the nonredundant informative rules. An association rule is informative if its premise (resp. conclusion) is minimal (resp. maximal). For lack of space, certain proofs of the Properties are omitted.

Generating of G γ. , FC, MFC and F MIN
Our main motivation lies in absence of an autonomous approach for mining G γ. , FC, MFC and F MIN . We then propose an efficient algorithm, GC2M, that simultaneously collects these four sets G γ. , FC, MFC and F MIN in database B. Here we briefly describe GC2M algorithm. It's composed of two algorithms (Algo. 1 and Algo. 2). Its main orginality lies in the effective support counting strategy: Let X be a frequent k-itemset (k ⩾ 3) and X a k − 1-subsets of X. Then, X is not a generator iff suppX = min{supp X| X ⊂ X} [2], i.e. no access to context B is made if X is non-generator. On search space pruning, it uses the following properties: (i) All subsets of a frequent are frequent, (ii) All supersets of an infrequent itemset are infrequent, (iii) All subsets of a generator are also generator, (iv) All supersets of a non-generator are also non-generator [2]. These results will be synthesized in the algorithm 1. The following Figure 1 shows examplary execution of Algorithm 1 with a small context B from table 1 and fixed minsup = 26. From MFC, we can derive the set F MIN of minimimal infrequent in B.

Definition 1 (Minimal infrequent itemset)
Let MFC be the set of maximal frequent, and F the set of frequent in B. The set F MIN of minimal infrequent itemsets in B is defined as : (1)  11: Pruning the infrequent itemsets

Generating non-redundant PNARs
This Subsection is based essentially on 5 components : Pruning UAR, Modelization of significant rules, Search space pruning, Pruning redundant PNARs, and CONCISE algorithm. We first formalize the idea of UAR, and then propose a strategy to prune UAR. Note that the classic support-confidence [1] is not able to prune UAR. Table 2 ullustrates these limits. The information given in this Table 2 can be used to evaluate the association A → B and tea → coffee. For the pair A, B, we have suppA ∪ B = 0.72 and conf A → B = 0.9. For the pair (tea,coffee), we have supptea ∪ coffee = 0.2 and conf tea → coffee = 0.8. The support and confidence are considered fairly high for both rules, i.e. A → B and tea → coffee are interesting rules. How-ever, P B ′ |A ′ = P B ′ = 0.9 and conf tea → coffee = 0.8 < 0.9 = suppcoffee implie A and B are independent (resp. tea disfavors coffee), i.e. A → B and tea → coffee are UAR.

Definition 2 (Uninteresting Association Rules (UAR))
We then propose an UAR pruning strategy by measuring the degree dependency of X and Y , denoted Δ X,Y = P Y ′ |X ′ − P Y ′ . We then use M GK measure [7], defined as : The M GK refers to dependencies between the antecedent and consequent of an association rule. Values in −1, 0 show that there is a negative dependence between X and Y . Values in 0, 1 show that there is a positive dependence between X and Y . Value equal 0 show that Y independent on X. We recall, rules with M GK equal to 1 are called Exact Association Rules, and rules with M GK less than 1 are called Approximate Rules. Theorem 1 below states that value of UARs defined by Definition 2 will be statistically null or negative.
From the same example of table 2, we have M GK A → B = 0.9−0.9 1−0.9 = 0, this verifies that A and B are independent. So, A → B is UAR. We also obtain M GK tea → coffee = 0.8−0.9 1−0.9 = −1 < 0, this means that coffee and tea are negatively dependent. In other words tea → coffee is UAR. As result, the UARs are systematically pruned using M GK .

Modelization of significant rules using M GK .
Note that the first component of M GK (Eq. (2)) is implicative but the second is not, only the first will be active in modelization. We introduce the quantities n = |T |, n X = |ϕX|, n Y = |ϕY |, n X∧Y = |ϕX ∪ Y | and n X∧Y = |ϕX ∪ Y |. The quantity N X∧Y indicates a random variable which generates n X∧Y , and N X∧Y generates n X∧Y . In that case, the Eq. (2) can be rewritten : The current versions [7] are based on critical value γ α defined as where α a real in the interval 0, 1 and χ 2 α is a Chi-square statistic of a single degree of freedom. This means that X → Y will be valid if M GK X → Y ⩾ γ α . However, this critical value can nevertheless present some limits. Indeed, a low α leads to a high critical value which rapidly exceeds the M GK value. This rejects certain robust rules. Conversely, a large value of α leads to a very low critical value. This accepts certain very weak rules (i.e. independent rules).
To overcome these limits, we define a new model based on the test H 0 independence hypothesis of X and Y in the face of a positive dependence hypothesis H 1 , of the rule X → Y . We then model, under H 0 independence hypothesis, the probability between the random variable N X∧Y and the observed counter-examples n X∧Y using measure M GK . We notice that the sensitivity of this measure M GK to variations in the occurrences of the observed counterexamples n X∧Y reades with the partial derivative given in the following Equation (5): This shows that M GK decreases when the number n X∧Y increases and all the more quickly as the quantity n X n Y n is low. In other words, M GK grows when n X∧Y decreases, which is semantically acceptable, but the rate of variation is constant, independent of the rate of decrease of this number, variations of n Y . Consider M GK as the realization of a variable M GK , defined as: It is the opposite of the directed contribution of the cell X ∪ Y to the χ 2 n except for a constant. In practice, it is quite common to observe a few transactions which contain X and not Y without having the general trend to have Y when X is present contested. Therefore, n X∧Y must be taken into account to statistically accept to retain or not the rule X → Y . Suppose we draw at random two subsets U, Z ⊆ I which contain n X and n Y respectively, i.e. N X∧Y = |ϕU ∪Z|. This variable N X∧Y follows a Poisson law with parameter n X n Y n [9]. We then measure the smallness of random variable N X∧Y expected to the number n X∧Y under H 0 independence hypothesis between X and Y . Such an association rule X → Y is then said to be admissible at the threshold α ∈0, 1 if the probability that the random variable N X∧Y is lower than that observed number n X∧Y under H 0 independence hypothesis on X and Y is relatively low : We then have : Our model of significant association rules is given in the following Definition 3.

Search space pruning.
Pasquier's approach [14] is the most popular approach for generating of non-redundant rules. However, no methods for search space pruning of significant valid rules is used by this approach. While it is possible to restrict the search space by partitioning into 2 the 8 ( We then explain this restriction. In [2], we demonstrated that if X favors Y (i.e. P Y ′ |X ′ > P Y ′ ), then these are the four association rules X → Y , Y → X, X → Y and Y → X, which will be studied. If X disfavors Y (i.e. P Y ′ |X ′ ⩽ P Y ′ ), then these are the four contrary association rules X → Y , X → Y , Y → X and Y → X which will be studied. We then obtain two classes: We also demonstrated that all rules of C 1 can be derived from X → Y , and all rules of C 2 can be derived from X → Y . So, we only study two rules such as X → Y and X → Y . This gives the reduction space 100(8-2)/8=75%.

Pruning redundant PNARs.
The most popular method to prune redundant rules is the base of rules that is a set of reduced size rules that do not contain any redundant rule. Definition 4 defines a redundant rule.

Definition 4 (Redundant rule)
The rule r 1 : Corresponding to the three popular approaches [7], [14], [16], we propose three more efficient bases called Concise Bases as defined in Definitions 6, 7 and 8. In addition, we define a base for Positive Exact Rules using M GK , called CBE (cf. Definition 5). More precisely, CBE basis is similar of Base for Exact Rules defined in [14], because an exact rule of Confidence is also exact of M GK (cf. [2]). We prove that these concise bases are a lossless representation of non-redundant rules since all valid rules can be derived from these (cf. Theorems 2, 3, 4, 5).

Definition 5 (CBE Basis)
Let FC be the set of frequent closed itemsets. For each C ∈ FC, let G C be the set of minimal generators of C, we have:

Theorem 2 (i) All valid positive exact rules and their supports can be derived from to CBE basis. (ii) All rules in CBE are non-redudant exact rules.
Proof 1 i Let r 1 : X 1 → Y 1 \X 1 be the exact positive rule between two frequents X 1 and Y 1 such that X 1 ⊂ Y 1 . Let C be a frequent closed itemset in B (i.e. C ∈ FC). Since M GK r 1 = 1, we have suppX 1 = suppY 1 . From suppX 1 = suppY 1 , we derived that suppγX 1 = suppγY 1 ⇒ γX 1 = γY 1 = C. Obviously, there exists a rule r 2 : G → C\G ∈ CBE such that G is a generator of C for which G ⊆ X 1 and G ⊆ Y 1 . We show that the rule r 1 and its supports can be derived from the rule r 2 and its supports. From γX 1 = γY 1 = C and γG = C, we then have suppr 1 = suppγX 1 = suppγY 1 = suppC = suppr 2 , and deduce that M GK r 1 = M GK r 2 . This explains that r 1 can be derived from r 2 , and is a redundant rule of r 2 , so it's pruned in CBE base.
ii Let r 2 : G → C\G ∈ CBE , we then have G ∈ G C and C ∈ FC. We demonstrate that there is no other rule r 3 : X 3 → Y 3 \X 3 ∈ CBE such as suppr 3 = suppr 2 , M GK r 3 = M GK r 2 , X 3 ⊆ G and C ⊆ Y 3 . If X 3 ⊆ G, we then have γX 3 ⊆ γG = C. We deduce that In other words, r 2 is non-redundant. This proves that CBE is a non-redundant base. Definition 6 (CBA Basis) Let FC be the set of frequent closed. For each C ∈ FC, let G C be the set of generators of C. Consider 0 < α ⩽ 1, we have: Theorem 3 (i) All valid positive approximate association rules, their supports and M GK , can be derived from the rules of CBA. (ii) All association rules in the CBA basis are non-redundant approximate association rules.
ii Let r 2 : G → C\G ∈ CBA, we then have C ∈ FC and G ∈ G C . We demonstrate that there is no other rule r 3 : Obviously, ∃r 2 : G → y\G ∈ CBE − such that G ∈ G M for which G ⊆ X 1 and G ⊆ Y 1 , and thus G ⊆ y (by Definition 7). We show that the rule r 1 can be derived from r 2 . Since r 2 : G → y\G ∈ CBE − , we have suppG ∪ y = suppG. From suppG ∪ y = suppG, we have suppγG ∪ y = suppγG = suppγy ⇒ γG∪y = γG = γy = M a ′ . From relations a and a ′ , we have γG∪y = γX 1 ∪Y 1 ⇔ suppr 1 = suppr 2 .
These results explain that r 1 can be derived from r 2 , and is a redundant rule w.r.t r 2 .
ii Let r 2 : G → y\G ∈ CBE − , i.e. G ∈ G M and y ∈ F MIN . We demonstrate that there is no other rule r 3 : X 3 → Y 3 \X 3 ∈ CBE − such as suppr 3 = suppr 2 , M GK r 3 = M GK r 2 , X 3 ⊆ G and y ⊆ Y 3 . If X 3 ⊆ G, we then have γX 3 ⊆ γG ⊂ γy = M. We deduce that X 3 ∉ G M and conclude that r 3 ∉ CBE − . If y ⊆ Y 3 , we then have γG ⊂ γy ⊆ γY 3 = M. We deduce that G ∉ G Y 3 and conclude that r 3 ∉ CBE − . This implies that r 2 is a non-redundant rule, and proves that CBE − is a non-redundant base.

Definition 8 (CBA − Basis)
Let FC be the set of frequent closeds. For each C ∈ FC, let G C be the set of generators of C. Consider 0 < α ⩽ 1, we have:  Definition 8). We show that r 1 can be derived from This explains that r 1 can be derived from r 2 , and is a redundant rule w.r.t. r 2 . ii Let r 2 : G → g\G ∈ CBA − , i.e. G ∈ G C and g ∈ G C such that γG ⊊ γg (i.e. C ⊊ C). We demonstrate that there is no other rule 3 . This means that r 2 is a non-redundant rule, and proves that CBE − is a non-redundant base.    T10I4D100K  100 000  1 000  10  T20I6D100K  100 000  1 000  20  C20D10K  10 000  386  20  MUSHROOMS  8 416  128  23 We evaluate CONCISE with two comparable baseline approaches Pasquier's approach [14] and Feno's approach [7]. All algorithms are implemented in R. All the experiments are run on a PC Core i3-2350M with 4CPUs and 4GB memory on the Windows 7. We compare their number of valid rules and computational costs on different databases (cf. Table 3): T10I4D100K 1 , T20I6D100K (cf. footnote 1), C20D10K 2 and MUSHROOMS (cf. footnote 2). We make CONCISE and Feno's approach to follow the same constraint α = 5%. For Pasquier's approach, consider a minimal confidence minconf = 80%. The number of extracted rules for the three algorithms, by varying the minsup, is shown in Table 4. For this, E (resp. A) indicates the positive exact (resp. approximate) rules. E − (resp. A − ) indicates the negative exact (resp. approximate) rules. We also denote by "-" a subset which could not generated. We observe that no negative association rules are gener-  On dense databases (C20D10K and MUSHROOMS), CONCISE algorithm leads to significant average time compared to Pasquier's approach for all minsup (cf. Figure 2c and Figure 2d). The main reason is associated to the technique for pruning search space of valid positive/negative association rules. Thanks to the different optimizations as defined on Subsection 4.2.3, CONCISE algorithm can reduce considerable amount the execution time for all minimum support threshold minsup, it is not the case for Pasquier's approach. The latter obtains the least performance. This is mainly due to the lack of techniques for pruning the search space for valid association rules. This obviously affects its execution time. However, Pasquier's approach joins CONCISE algorithm for the E execution times, when minsup is 20% to 30%.

Conclusion
In this paper, we presented and evaluated a condensed representation for association rules. It is an efficient method for representing non-redundant positive and negative rules. We theoretically proved and experimentally confirmed that our approach can eliminate considerable amount of redundancy and uninteresting rules. Compared to the Pasquier's and Feno's approaches, our approach is not only a concise but also a lossless extraction of positive and negative association rules. From this, all informative association rules can be deduced. The perspective would be to extend this proposal in Graphs and Classification problems.

Conflicts of Interest
The authors declare that they have no conflicts of interest.