Title:

Kind
Code:

A1

Abstract:

A set of subscriptions are provided, where one or more subscriptions each comprises a tree pattern, and a tree pattern comprises one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information. The set of subscriptions is used to select information for dissemination to users. Generally, the one or more subscriptions having the tree pattern describe information the users are interested in receiving. Techniques are presented for determining an aggregation from the subscriptions, where the aggregation comprises a set of aggregate patterns. The set of subscriptions may comprise a number of tree patterns, and the aggregate patterns generally also comprise tree patterns comprising one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information. The set of aggregation patterns is smaller than the set of subscriptions and can be made to fit any space constraint.

Inventors:

Chan, Chee-yong (Singapore, SG)

Fan, Wenfei (Somerset, NJ, US)

Felber, Pascal Amedee (Chateauneuf de Grasse, FR)

Garofalakis, Minos N. (Morristown, NJ, US)

Rastogi, Rajeev (New Providence, NJ, US)

Fan, Wenfei (Somerset, NJ, US)

Felber, Pascal Amedee (Chateauneuf de Grasse, FR)

Garofalakis, Minos N. (Morristown, NJ, US)

Rastogi, Rajeev (New Providence, NJ, US)

Application Number:

10/600996

Publication Date:

12/23/2004

Filing Date:

06/20/2003

Export Citation:

Assignee:

CHAN CHEE-YONG

FAN WENFEI

FELBER PASCAL AMEDEE

GAROFALAKIS MINOS N.

RASTOGI RAJEEV

FAN WENFEI

FELBER PASCAL AMEDEE

GAROFALAKIS MINOS N.

RASTOGI RAJEEV

Primary Class:

Other Classes:

707/999.003, 707/E17.012

International Classes:

View Patent Images:

Related US Applications:

20090281990 | System and Method for Dissemination of Relevant Knowledge | November, 2009 | Greenspan et al. |

20050289192 | Dynamic navigation systems and methods | December, 2005 | Tromczynski et al. |

20090164531 | REMOTE COPY SYSTEM, REMOTE ENVIRONMENT SETTING METHOD, AND DATA RESTORE METHOD | June, 2009 | Tanaka et al. |

20090182765 | FOLDER STORAGE DEVICE | July, 2009 | Kawasaki |

20060259504 | Portable electronic device and list display method | November, 2006 | Kusu |

20050198008 | Index exploitation for spatial data | September, 2005 | Adler |

20080027969 | HIERARCHICAL CONDITIONAL RANDOM FIELDS FOR WEB EXTRACTION | January, 2008 | Wen et al. |

20070220004 | Security view-based, external enforcement of business application security rules | September, 2007 | Fifield et al. |

20020143762 | Envelope printing feature for photo filing system | October, 2002 | Boyd et al. |

20050228797 | Suggesting and/or providing targeting criteria for advertisements | October, 2005 | Koningstein et al. |

20080133590 | Application loader for support of version management | June, 2008 | Blanch et al. |

Primary Examiner:

GMAHL, NAVNEET K

Attorney, Agent or Firm:

Ryan, Mason & Lewis, LLP (Fairfield, CT, US)

Claims:

1. In a communication system, a method for information dissemination, the method comprising the steps of: providing a set of subscriptions, at least one of the set of subscriptions comprising a tree pattern, wherein the tree pattern comprises one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information; and using the set of subscriptions to select information for dissemination to one or more users.

2. The method of claim 1, wherein the at least one subscription describes information the one or more users are interested in receiving.

3. The method of claim 1, further comprising the step of determining an aggregation from the set of subscriptions, the aggregation comprising a set of aggregate patterns, wherein the set of aggregate patterns is smaller than the set of subscriptions, and wherein the step of using the set of subscriptions to select information for dissemination further comprises using the set of aggregate patterns to select the information for dissemination to the one or more users.

4. The method of claim 1, wherein the information comprises one or more documents defined using extensible markup language (XML).

5. The method of claim 3, wherein at least one of the aggregate patterns and the tree pattern each is defined using extensible markup language (XML).

6. The method of claim 3, wherein each aggregate pattern and each subscription comprises a tree pattern having one or more interconnected nodes having a hierarchy, and wherein the set of aggregate patterns is smaller than the set of subscriptions in that a number of aggregate patterns in the set of aggregate patterns is smaller than a number of tree patterns in the set of subscriptions and that a number of nodes in the set of aggregate patterns is smaller than a number of nodes in the set of subscriptions.

7. The method of claim 3, wherein the step of determining an aggregation further comprises the step of determining the aggregation from the set of subscriptions by using at least a space constraint.

8. The method of claim 7, wherein the space constraint comprises a predetermined number of bytes.

9. The method of claim 3, wherein the set of subscriptions comprises a plurality of tree patterns, each of the tree patterns comprising one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information, and wherein the step of determining an aggregation further comprises the step of determining a least upper bound pattern for two of the plurality of tree patterns in the set of subscriptions, the least upper bound pattern chosen as an aggregate pattern.

10. The method of claim 9, wherein the two tree patterns are a first tree pattern and a second tree pattern, and wherein the step of determining a least upper bound pattern further comprises the steps of: if the first tree pattern is contained in the second tree pattern, setting the least upper bound pattern to be the first tree pattern; if the second tree pattern is contained in the first tree pattern, setting the least upper bound pattern to be the second tree pattern; traversing the first and second tree patterns and computing a tightest container pattern by: computing a position-preserving tightest container pattern by finding common sub-patterns; computing an off-position tightest container pattern by finding common sub-patterns; and constructing the tightest container pattern by taking a union of the position-preserving tightest container pattern and the off-position tightest container pattern, wherein the tightest container pattern is used as the least upper bound pattern.

11. The method of claim 9, wherein the step of determining a least upper bound pattern for two of the plurality of tree patterns further comprises the steps of determining a tightest container pattern for the two tree patterns and minimizing the tightest container pattern to create a minimal pattern, wherein the minimal pattern is used as the least upper bound pattern.

12. The method of claim 3, wherein the set of subscriptions comprises a plurality of tree patterns, wherein each tree pattern in the set of subscriptions comprises one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information, and wherein the step of determining an aggregation further comprises the steps of: designating a candidate set of tree patterns to be the plurality of tree patterns; performing the following steps: identifying a set of candidate aggregate patterns from the plurality of tree patterns and similar tree patterns determined from the candidate set of tree patterns; pruning each candidate aggregate pattern by deleting or merging nodes; selecting a chosen tree pattern from the candidate aggregate patterns having a predetermined marginal gain; and replacing all tree patterns, in the candidate set of tree patterns, that are contained in the chosen tree pattern by the chosen tree pattern.

13. The method of claim 12, wherein the marginal gain is determined by a benefit value of a tree pattern.

14. The method of claim 13, wherein the candidate set of tree patterns occupies a space and wherein the benefit value is determined from a ratio of savings in the space for a corresponding tree pattern to a loss in selectivity for the corresponding tree pattern.

15. The method of claim 14, wherein the selectivity is determined by sampling matching of information to candidate patterns.

16. In a communication system, an apparatus for providing information dissemination, the apparatus comprising: a memory; and at least one processor, coupled to the memory; the apparatus operative: to provide a set of subscriptions, at least one of the set of subscriptions comprising a tree pattern, wherein the tree pattern comprises one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information; and to use the set of subscriptions to select information for dissemination to one or more users.

17. An article of manufacture for providing information dissemination, the article of manufacture comprising: a machine readable medium containing one or more programs which when executed implement the steps of: providing a set of subscriptions, at least one of the set of subscriptions comprising a tree pattern, wherein the tree pattern comprises one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information; and using the set of subscriptions to select information for dissemination to one or more users.

Description:

[0001] The present invention relates generally to communication over networks, and, more particularly, to communication of electronic information over networks.

[0002] Large amounts of document transfer occur over networks every day, and standards have been implemented to make the document transfer easier. On the Internet, for instance, extensible markup language (XML) has become a dominant standard for encoding and exchange of documents, including electronic business transactions in both Business-to-Business (B2B) and Business-to-Consumer (B2C) applications. Given the rapid growth of document traffic on the Internet, the effective and efficient delivery of documents such as XML documents has become an important issue. Consequently, there is growing interest in the area of content-based filtering and routing, which addresses the problem of effectively directing high volumes of document traffic to interested users based on document contents. In conventional routing, packets are routed over a network based on a limited, fixed set of attributes, such as source/destination Internet protocol (IP) addresses and port numbers. By contrast, content-based document routing is based on information in document contents, and is therefore more flexible and demanding.

[0003] In a system that provides filtering and routing for document dissemination, users typically specify their subscriptions. Subscriptions indicate the type of content that users are interested in, and generally use some pattern specification language. For each incoming document, a content-based document router matches the document contents against a set of subscriptions to identify a set of interested users, and then routes the document to any interested users. Thus, in content-based routing, the “destination” of a document is generally unknown to the data producer and is computed dynamically based on the document contents and a set of subscriptions. Effective support for scalable, content-based routing is crucial to enabling efficient and timely delivery of relevant documents to a large, dynamic group of users.

[0004] Unfortunately, there are problems with current document dissemination systems that limit scalability. One problem is space requirements, as user subscriptions can become quite large, potentially having gigabytes of information. A competing problem is the speed at which a determination can be made as to whether a document should be disseminated to users. Ideally, as network streaming speed increases, the speed at which document comparison takes place also should increase. Both speed and space requirements are impacted by increased numbers of subscriptions and therefore affect scalability, as more subscriptions place burdens on both speed and space.

[0005] Consequently, a need exists for information dissemination techniques for networks that allow a high number of subscriptions yet also provide high speed document dissemination.

[0006] The present invention provides techniques that provide information dissemination through, among other things, subscriptions in the form of tree patterns and tree pattern aggregation.

[0007] In an aspect of the invention, a set of subscriptions are provided, where one or more subscriptions comprise a tree pattern. A tree pattern illustratively comprises one or more interconnected nodes having a hierarchy and are adapted to specify content and structure of information. The set of subscriptions is used to select information for dissemination to users. Generally, the one or more subscriptions having the tree pattern describe information the users are interested in receiving. Illustratively, subscriptions that use tree patterns are more expressive and practical than conventional subscriptions.

[0008] In another aspect of the invention, techniques are presented for determining an aggregation from the subscriptions. An aggregation may be determined from the set of subscriptions, and the aggregation comprises a set of aggregate patterns. The set of subscriptions may comprise a number of tree patterns, and the aggregate patterns generally also comprise tree patterns comprising one or more interconnected nodes having a hierarchy and adapted to specify content and structure of information.

[0009] Illustratively, the set of aggregate patterns is smaller than the set of subscriptions in that the number of aggregate pattern is less than the number of tree patterns in the subscriptions and the number of nodes in the set of aggregate patterns is smaller than the number of nodes in the set of subscriptions. Broadly, the aggregate patterns “compress” the subscriptions and therefore provide smaller memory requirements and generally faster comparisons between information and the aggregation. There may be some loss of precision due to the “compression,” but the loss of precision is generally kept low through techniques described below.

[0010] In a further aspect of the invention, the aggregation techniques can be applied using a space constraint. The space constraint can be imposed, for example, by system configuration. The space constraint may be used to limit the size of memory available for storing an aggregation. The space constraint is generally expressed in bytes and can be measured with respect to the number of nodes in the set of aggregate patterns of the aggregation.

[0011] In another aspect of the invention, a systematic study of least upper bound patterns is described. The least upper bound of a set of tree patterns can be considered a most precise aggregation of the set. A theoretical foundation for the existence of the most precise aggregation is described, as is a complexity of the computation for the least upper bound, techniques for computing a least upper bound, and techniques for minimizing a least upper bound.

[0012] In yet another aspect of the invention, when the least upper bound of a set of subscriptions is larger than the given space constraint, techniques are presented for computing an approximation of the least upper bound in order to meet the space constraint. The least upper bound of a set of subscriptions may be considered to be the most precise aggregation for the set. The approximation of the least upper bound is an aggregation that satisfies the space constraint and minimizes loss of precision as much as possible. The approximation may be determined by setting a candidate set of tree patterns to be the tree patterns in the subscriptions. The following steps may be performed and iterated: a set of candidate aggregate patterns may be identified from the plurality of tree patterns and similar tree patterns determined from the candidate set of tree patterns; each candidate aggregate pattern may be pruned by deleting or merging nodes; a chosen tree pattern may be selected from the candidate aggregate patterns having a predetermined marginal gain; and all tree patterns, in the candidate set of tree patterns, that are contained in the chosen tree pattern may be replaced by the chosen tree pattern.

[0013] Additionally, the pruning process may be directed by using selectivity of information, in that only nodes with low selectivity, i.e., low frequency of document matching, can be removed. Thus, loss of preciseness is reduced. The frequency of matching is determined by sampling information and thereby determining selectivity of the information.

[0014]

[0015]

[0016]

[0017]

[0018]

[0019]

[0020]

[0021]

[0022] For ease of reference, the present disclosure is divided into the following sections: Introduction; Problem Formulation; Computing Precise Aggregates; and Selectivity-Based Aggregation Methods.

[0023] 1. Introduction

[0024] Turning now to

[0025] Broadly, XML documents

[0026] The present invention solves these problems by, among other things, providing subscriptions

[0027] The present invention also provides aggregation of subscriptions that are tree patterns. Broadly, given a large volume of potential users, system scalability and efficiency mandate the ability to judiciously aggregate the set of subscriptions

[0028] The present disclosure describes, among other things, a subscription aggregation problem where subscriptions

[0029] Thus, the document router

[0030] It should be noted that the memory

[0031] The filter module

[0032] The following example is illustrative of problems associated with tree patterns _{a }_{b}_{a }_{b }_{a }_{b}_{a}_{b}_{a}_{b}_{c }_{d}_{a }_{b }_{c }_{d}_{c }_{d }_{c }_{d }_{a}_{b}_{c }_{a }_{b}_{c }_{d }_{d }_{c}

[0033] The present disclosure describes efficient methods for deciding tree pattern containment, minimizing a tree pattern, and computing the most precise aggregate (i.e., the “least upper bound”) for a set of patterns. Additionally, an efficient method is proposed that exploits coarse statistics on the underlying distribution of XML documents to compute a “precise” set of aggregate patterns within the allotted space budget. Specifically, disclosed techniques employ document statistics to estimate the selectivity of a tree pattern, which is also used as a measure of the preciseness of the pattern. Thus, an aggregation problem can be reduced to finding a compact set of aggregate patterns with minimal loss in selectivity, for which a greedy heuristic is presented herein.

[0034] The usefulness of the present invention on tree patterns and their aggregation is not limited to content-based routing, but also extends to other application domains such as the optimization of XML queries involving tree patterns and the processing and dissemination of subscription queries in a multicast environment (e.g., where aggregation can be used to reduce server load and network traffic). Further, the present invention is complementary to recent work on efficient indexing structures for XPath expressions. The focus of earlier research was to speed up document filtering with a given set of XPath subscriptions using appropriate indexing schemes. In contrast, the present invention focuses on effectively reducing the volume of subscriptions that need to be matched in order to ensure scalability given bounded storage resources for routing.

[0035] 2. Problem Formulation

[0036] 2.1 Definitions

[0037] A tree pattern is an unordered node-labeled tree that specifies content and structure conditions on an XML document. More specifically, a tree pattern p has a set of nodes, denoted by Nodes(p), where each node v in Nodes(p) has a label, denoted by label(v), which can either be a tag name, a “*” (wildcard that matches any tag), or a “//” (the descendant operator). In particular, the root node has a special label “/.”. The terminology Subtree (v, p) is used to denote the subtree of p rooted at v, referred to as a sub-pattern of p. Some examples of tree patterns are depicted in

[0038] To define the semantics of a tree pattern p, the semantics are first given of a sub-pattern Subtree (v, p), where v is not the root node of p. Recall that XML documents are typically represented as node-labeled trees, referred to as XML trees. Let T be an XML tree and t be a node in T. It is said that T satisfies Subtree (v, p) at node t, denoted by (T, t)

[0039] The semantics of tree patterns are now defined. Let T be an XML tree with root t_{root}_{root}_{root}_{root }_{root}_{root}_{root }_{root}_{root }_{root}_{root }_{i }_{i}

[0040] Consider the tree pattern p_{a }_{a }_{a }_{a }_{a}

[0041] A tree pattern p is said to be consistent if and only if there exists an XML document that satisfies p. Generally, only consistent tree patterns are considered herein. Further, the tree patterns defined above can be naturally generalized to accommodate simple conditions and predicates (e.g., issue=“GE” and price<1000). To simplify the discussion, such extensions are not considered herein.

[0042] It is worth mentioning that a tree pattern can be easily converted to an equivalent XPath expression in which each sub-pattern is expressed as a condition or qualifier. Thus, tree patterns herein are graph representations of a class of XPath expressions. It is tempting to consider using a larger fragment of Xpath to express subscription patterns. However, it turns out that even a mild generalization of the tree patterns used herein (e.g., with the addition of union/disjunction operators) leads to a much higher complexity (e.g., coNP-hard or beyond) for basic operations such as containment computation.

[0043] A tree pattern q is said to be contained in another tree pattern p, denoted by q

[0044] The size of a tree pattern p, denoted by |p|, is simply the cardinality of its node set. For example, referring to _{a}_{b}

[0045] 2.2 Problem Statement

[0046] The tree pattern aggregation problem that we investigate in this paper can now be stated as follows. Given a set of tree patterns S and a space constraint k on the total size of the aggregated subscriptions, compute a set of aggregated patterns S′ that satisfies all of the following three conditions:

[0047] (C1) S

[0048] (C2) Σ_{p′εS′}

[0049] (C3) S′ is as “precise” as possible, in the sense that there does not exist another set of tree patterns S″ that satisfies the first two conditions and S″

[0050] Clearly, the tree pattern aggregation problem may not necessarily have a unique solution since it is possible to have two sets S′ and S″ that satisfy the first two conditions but S′

[0051] With respect to conciseness, the present disclosure considers minimal tree patterns that do not contain any “redundant” nodes. More precisely, it is said that a tree pattern p is minimized if for any tree pattern p′ such that p′≡p, it is the case that |p|≦|p′|. With respect to preciseness, it can be shown that the containment relationship

[0052] An upper bound of two tree patterns p and q is a tree pattern u such that p

[0053] Clearly, if p is an aggregate tree pattern for a set of tree patterns S (i.e., S

[0054] Consider again the tree patterns in _{b}_{c}_{b}_{c}_{b }_{b}_{a}_{c }_{a }_{c}_{a }_{c }_{a}_{d }_{c}_{d}_{a }_{c}_{d}_{a}_{c }_{e}_{a }_{c }_{e}_{d}_{e}_{a}_{c }_{e}_{a}_{c}_{h}_{c}_{f }_{h}_{c}_{f}_{d }_{a}_{b}_{c}_{e}_{f}_{g}_{h}

[0055] This section is concluded by presenting some additional notation used herein. For a node v in a tree pattern p, the set of child nodes of v in p is denoted by Child(v,p). A partial ordering

[0056] For example, MaxLabel (a,b)=* and MaxLabel (*,//)=//. For notational convenience, anode v in a tree pattern is referred to as an l-node if label(v)=l, and v is referred to as a tag-node if label(v)∉{/.,*,//}.

[0057] 3. Computing Precise Aggregates

[0058] In this section, a special case of our tree pattern aggregation problem is considered. Namely, when the aggregate set S′ consists of a single tree pattern and there is no space constraint. For this case, methods are described to compute the most precise aggregate tree pattern (i.e., LUB) for a set of tree patterns. Some of the methods given in this section are also key components of a solution for the general problem, which is presented in the next section.

[0059] Given two input tree patterns p and q, Method LUB in

[0060] (1) R consists of container sub-patterns of p′ and q′, i.e., for any XML document T and any element t in T, if (T,t)

[0061] (2) R is tightest in the sense that for any other set of container sub-patterns R′ of p′ and q′ that satisfies condition (1), any XML document T and any element t in T, if (T,t)

[0062] Intuitively, R is a collection of conditions imposed by both p′ and q′ such that if T satisfies p′ or q′ at t, then T also satisfies the conjunction of these conditions at t. It is now shown how the LUB for p and q can be computed from the tightest container sub-patterns. Let v_{root }_{root }_{root}_{root}_{root}

[0063] The main subroutine in the LUB computation (Method LUB_SUB, shown in _{c }_{f }_{c }_{f}

[0064] By computing the tightest container sub-patterns recursively, the method computes the LUB of the input tree patterns p and q. By induction on the structures of p and q, the following result can be shown: Given two tree patterns p and q, Method LUB (p,q) computes p␣q.

[0065] Consider the following example. Given p_{c }_{f }_{h }_{c}_{f}_{h}_{n }^{th }_{h}_{1 }_{3 }

[0066] Method LUB_SUB (invoked by Method LUB) first extracts the “position reserving” tightest container sub-patterns for Subtree (a_{1}_{c}_{f}_{1}_{1}_{h}_{h}_{f}_{2}_{c}_{f}_{1}_{h}_{2}_{c}_{f}_{1}_{c}_{f}_{f}_{2}_{c}_{3}_{h}_{1}_{c}_{f}

[0067] It should be mentioned that both Subtree (//_{1}_{h}_{2}_{h}_{2}_{c}_{f}_{f}_{h}

[0068] It is straightforward to show that the LUB operator “␣”, considered as a binary operator, is commutative and associative, i.e., p_{1}_{2}_{2}_{1 }_{1}_{2}_{3}_{1}_{2}_{3}

[0069] Method LUB needs to check the containment of tree patterns, which is implemented by Method CONTAINS in _{root}_{root}_{root }_{root }

[0070] The main subroutine in our containment method is Method CONTAINS_SUB (see _{d }_{f }_{f}_{d}_{d }_{f }_{f }_{d }_{d }_{d }_{g }_{g}_{d }_{d }_{g }_{g}_{d}_{g}_{d}

[0071] By induction on the structures of p and q, the following result can be shown: Given two tree patterns p and q, Method CONTAINS (p,q) determines if q

[0072] The quadratic time complexity of our tree-pattern containment method is due to, among other things, the fact that each pair of sub-patterns in p and q is checked at most once, because of the use of the Status array. To simplify the discussion, subtle details have omitted from Method CONTAINS. These details involve tree patterns with chains of //- and *-nodes. Such cases require some additional pre-processing to convert the tree pattern to some canonical form, but this does not increase our method's time complexity.

[0073] To ensure that tree patterns are concise, identification and elimination of “redundant” nodes are performed. Given a tree pattern p, a minimized tree pattern p′ equivalent to p can be computed using a recursive method MINIMIZE. Starting with the root of p, our minimization method performs the following two steps to minimize the sub-pattern Subtree(v,p) rooted at node v in p: (1) For any v′, v″εChild (v, p), if Subtree(v′, p)

[0074] It can be shown that Method MINIMIZE minimizes any tree pattern p in O(|p|^{2}

[0075] Given the low computational complexities of CONTAINS and MINIMIZE, one might expect that this would also be the case for Method LUB. Unfortunately, in the worst case, the size of the (minimized) LUB of two tree patterns can be exponentially large. Implementation results, however, demonstrate that the LUB method exhibits reasonably low average case complexity in practice.

[0076] 4. Selectivity-Based Aggregation Methods

[0077] While the LUB method presented in the previous section can be used to compute a single, most precise aggregate tree pattern for a given set S of patterns, the size of the LUB may be too large and, therefore, may violate the specified space constraint k on the total size of the aggregated subscriptions (Section 2.2). Thus, in order to fit aggregates within the allotted space budget, the requirement of a single precise aggregate is relaxed by permitting a solution to be a set S′={p_{1}_{2}_{m}_{i}

[0078] A simple measure of the preciseness of S′ is its selectivity, which is essentially the fraction of filtered XML documents that satisfy some pattern in S′. Thus, an objective is to compute a set S′ of aggregate patterns whose selectivity is very close to that of S. Clearly, the selectivity of tree patterns is highly dependent on the distribution of the underlying collection of XML documents (denoted by D). It is, however, generally infeasible to maintain the detailed distribution D of streaming XML documents for our aggregation—the space requirements would be enormous! Instead, an approach herein is based on building a concise synopsis of D on-line (i.e., as documents are streaming by), and using that synopsis to estimate tree-pattern selectivities. At a high level, an illustrative aggregation method iteratively computes a set S′ that is both selective and satisfies the space constraint, starting with S′=S (i.e., the original set S of patterns), and performing the following sequence of steps in each iteration:

[0079] (1) Generate a candidate set of aggregate tree patterns C consisting of patterns in S′ and LUBs of similar pattern pairs in S′.

[0080] (2) Prune each pattern p in C by deleting/merging nodes in p in order to reduce its size.

[0081] (3) Choose a candidate pattern pεC to replace all patterns in S′ that are contained in p. The candidate-selection strategy is based on marginal gains: The selected candidate p is the one that results in the minimum loss in selectivity per unit reduction in the size of S′ (due to the replacement of patterns in S′ by p).

[0082] Note that the pruning step (step

[0083] In the following subsections, an exemplary method for computing S′ is described in detail. First, an approach is presented for estimating the selectivity of tree patterns over the underlying document distribution, which is critical to choosing a good replacement candidate in step 3 above.

[0084] 4.1 Selectivity Estimation for Tree Patterns

[0085] The document tree synopsis is now described. As mentioned above, it is simply impossible to maintain the accurate document distribution D (i.e., the full set of streaming documents) in order to obtain accurate selectivity estimates for our tree patterns. Instead, an exemplary approach is to approximate D by a concise synopsis structure, which is referred to herein as the document tree. A document tree synopsis for D, denoted by DT, captures path statistics for documents in D, and is built on-line as XML documents stream by. The document tree essentially has the same structure as an XML tree, except for two differences. First, the root node of DT has the special label “/.”. Second, each non-root node t in DT has a frequency associated with it, denoted by freq(t). Intuitively, if l_{1}_{2}_{n }_{1}_{2}_{n }_{8 }_{8}_{8 }_{8 }

[0086] Next, T_{8 }_{8}_{1}_{2}_{n}_{1}_{2}_{3 }

[0087] It should be noted that a selectivity estimation problem for tree patterns differs from the work of Aboulnaga in two important respects. First, in Aboulnaga, the authors consider the problem of estimating selectivity for only simple paths that consist of a //-node followed by tag nodes. In contrast, here selectivities are estimated of general tree patterns with branches, and *- or //-nodes arbitrarily distributed in the tree. Second, selectivity at the granularity of documents is important herein, so a goal is to estimate the number of XML documents that match a tree pattern; instead, Aboulnaga addresses the selectivity problem at the granularity of individual document elements that are discovered by a path. It can be seen that these are two very different estimation problems.

[0088] A selectivity estimation procedure is now described. Recall that the selectivity of a tree pattern p is the fraction of documents T in D that satisfy p. By construction, a DT synopsis gives accurate selectivity estimates for tree patterns comprising a single chain of tag-nodes (i.e., with no * or //). However, obtaining accurate selectivity estimates for arbitrary tree patterns with branches, *, and // is, in general, not possible with DT summaries. This is because, while DT captures the number of documents containing a single path, it does not store document identities. As a result, for a pair of arbitrary paths in a tree pattern, it is generally hard to determine the exact number of documents that contain both paths or documents that contain one path, but not the other.

[0089] An exemplary estimation procedure solves this problem, by making the following simplifying assumption: The distribution of each path in a tree pattern is independent of other paths. Thus, selectivity is estimated of a tree pattern containing no // or * labels, simply as the product of the selectivities of each root to leaf path in the pattern. For patterns containing // or *, all possible instantiations are considered for // and * with element tags, and then chosen as a pattern selectivity the maximum selectivity value over all instantiations. Selectivity estimation methodology is illustrated in the following example.

[0090] Consider the problem of estimating the selectivities of the tree patterns shown in _{1 }_{1 }_{2 }_{3 }_{1}_{2}_{2 }_{2 }_{2 }_{3 }_{2}_{2 }_{2}_{2 }

[0091] Finally, the selectivity of p_{3 }

[0092] Method SEL (depicted in _{root }_{root }_{3 }_{3}

[0093] A goal is to compute SelSubPat[v_{root}_{root}_{c }_{c}_{c }_{c}_{c}_{c}_{c}_{t}_{c}_{εChild(t,DT) }_{c}_{c}_{c }_{c}_{c }_{c }_{t}_{c}_{εChild(t,DT)}

[0094] 4.2 Tree Pattern Aggregation Method

[0095] A “greedy” heuristic method is now presented for the tree pattern aggregation problem defined in Section 2.2 (which is, in general, an NP-hard clustering problem). As described earlier, to aggregate an input set of tree patterns S into a space-efficient and precise set, the method (Method AGGREGATE in

[0096] Candidate generation is now described. An exemplary process is described for generating the candidate set C in steps

[0097] Candidate selection is now described. Once the set of candidate aggregate patterns has been generated, some criterion is beneficial for selecting the “best” candidate to insert into S′. For this purpose, associate a benefit value with each candidate aggregate pattern xεC, denoted by Benefit(x), based on its marginal gain; that is, define Benefit(x) as the ratio of the savings in space to the loss in selectivity of using x over {p|p_{x}_{root}_{root }_{p}_{root }

[0098] Note that the selectivity loss is computed by comparing the selectivity of the candidate aggregate pattern x with that of the least selective pattern contained in it. This gives a good approximation of the selectivity loss in cases when the patterns p,qεS′ used to generate x are similar and overlap in the document tree DT. The candidate aggregate pattern with the highest benefit value is chosen to replace the patterns contained in it in S′ (steps

[0099] It is to be understood that the embodiments and variations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention. For example, the subscriptions could contain both tree patterns and non-tree patterns. The various assumptions made herein are for the purposes of simplicity and clarity of illustration, and should not be construed as requirements of the present invention.