|
 |

1. |
Y. Lai, R. Orlandic, W.G. Yee and S. Kulkarni, "Scalable Clustering for Large High-Dimensional Data Based on Data Summarization," IEEE Symposium on Computational Intelligence and Data Mining CIDM'2007,
Honolulu, Hawaii, 456-461, 2007. |
| Abstract: |
Clustering large data sets with high dimensionality is a challenging data-mining task. This paper presents a framework to perform such a task efficiently. It is based on the notion of data space reduction, which finds high density areas, or dense cells, in the given feature space. The dense cells store summarized information of the data. A designated partitioning or hierarchical clustering algorithm can be used as the second step to find clusters based on the data summaries. Using Kmeans as an example, this paper presents GARDEN-Kmeans, which performs data space reduction using Gamma Region DEN-sity partition, and utilizes Kmeans to cluster the summarized information. The experimental study shows that GARDEN-Kmeans executes several orders of magnitude faster than basic Kmeans and the recursive bisection Kmeans algorithm of CLUTO, while producing comparable clustering quality. |
2. |
B. Yu, T. Bailey and R. Orlandic, “Estimating the Performance of Multidimensional Access Methods based on Non-Overlapping Regions,” International Journal of Intelligent Systems, 22(3):259-277, 2007. |
| Abstract: |
Most cost models for the prediction of the performance of multidimensional access methods are based on Minkowski operations. However, this approach does not correspond to the space-partitioning strategy of multidimensional access methods that produce non-overlapping index regions, e.g. the KDB-tree and its variants. This paper proposes a new cost model for the prediction of the selection (search) performance of the related access methods, including its adaptation for non-uniform data. The results of an extensive set of experiments conducted on both simulated and real data show that the cost model and its extension provide a basis for a highly accurate analysis of these structures in both low- and high-dimensional situations. |
3. |
S. Kulkarni and R. Orlandic, “High-Dimensional Similarity Search Using Data-Sensitive Space Partitioning,”
Proc. 17th Intern’l Conf. on Database and Expert Systems Applications DEXA’2006,
Krakow, Poland, 738-750, 2006. |
| Abstract: |
Nearest neighbor search has a wide variety of applications. Unfortunately, the majority of search methods do not scale well with dimensionality. Recent efforts have been focused on finding better approximate solutions that improve the locality of data using dimensionality reduction. However, it is possible to preserve the locality of data and find exact nearest neighbors in high dimensions without dimensionality reduction. This paper introduces a novel high-performance technique to find exact k-nearest neighbors in both low and high dimensional spaces. It relies on a new method for data-sensitive space partitioning based on explicit data clustering, which is introduced in the paper for the first time. This organization supports effective reduction of the search space before accessing secondary storage. Costly Euclidean distance calculations are reduced through efficient processing of a lightweight memory-based filter. The algorithm outperforms sequential scan and the VA-File in high-dimensional situations. Moreover, the results with dynamic loading of data show that the technique works well on dynamic datasets as well. |
4. |
B. Yu and R. Orlandic, “Indexing Regional Objects in High-Dimensional Spaces,”
in: Advanced Topics in Database Research – Vol. 5, ed. K. Siau, Chapter 17,
Idea Group Publishing, 356–380, 2006. |
| Abstract: |
Many spatial access methods, such as the R-tree, have been designed to support spatial search operators (e.g., overlap, containment, and enclosure) over both points and regional objects in multi-dimensional spaces. Unfortunately, contemporary spatial access methods are limited by many problems that significantly degrade the query performance in high-dimensional spaces. This chapter reviews the problems of contemporary spatial access methods in spaces with many dimensions and presents an efficient approach to building advanced spatial access methods that effectively attack these problems. It also discusses the importance of high-dimensional spatial access methods for the emerging database applications, such as location-based services. |
5. |
R. Orlandic and S. Kulkarni, “The Architecture of a New Storage and Retrieval System for Scientific Studies,” Proc. Intern’l Conf. on Information and Knowledge Engineering IKE'2006,
Las Vegas, Nevada, 35-41, 2006.
|
| Abstract: |
The proliferation of large data repositories inspires continuing interest in storage systems and organization. This paper describes basic principles and architecture of a new prototype system for data storage and retrieval, called D*, which is designed to support advanced scientific studies in Data Grid environments. Unlike typical storage systems, the D* system enables a tighter integration of data-storage and data-mining technologies. Through the application of innovative retrieval and clustering techniques for high-dimensional data, D* can support high-performance data access and provide scientific applications useful insights into the data that can facilitate subsequent processes of data preparation and data mining. The system scales well with the increasing dimensionality of scientific data and works well on incremental loads of data. |
6. |
R. Orlandic, Y. Lai and W.G. Yee, “Clustering High-Dimensional Data Using an Efficient and Effective Data Space Reduction,” Proc. ACM 14th Conf. on Information and Knowledge Management CIKM'2005,
Bremen, Germany, 201–208, 2005. |
| Abstract: |
This paper introduces a new algorithm for clustering data in high-dimensional feature spaces, called GARDENHD. The algorithm is organized around the notion of data space reduction, i.e. the process of detecting dense areas (dense cells) in the space. It performs effective and efficient elimination of empty areas that characterize typical high-dimensional spaces and an efficient adjacency-connected agglomeration of dense cells into larger clusters. It produces a compact representation that can effectively capture the essence of data. GARDENHD is a hybrid of cell-based and density-based clustering. However, unlike typical clustering methods in its class, it applies a recursive partition of sparse regions in the space using a new space-partitioning strategy. The properties of this partitioning strategy greatly facilitate data space reduction. The experiments on synthetic and real data sets reveal that GARDENHD and its data space reduction are effective, efficient, and scalable. |
7. |
R. Orlandic and Y. Lai, "Clustering Technology of a Data Engine for Analytical Computing,"
Proc. IEEE 4th Intern'l Conf. on Intelligent Systems Design and Applications ISDA'04,
Budapest, Hungary, 699–704, 2004. |
| Abstract: |
Contemporary scientific studies frequently rely on data-intensive analytical computing. While the main goal of this emerging form of computing is to facilitate hypothesis formulation or to test the validity of a postulated model, its primary method is usually that of data clustering. Since typical analytical tasks operate on very large volumes of potentially highdimensional data, scientific studies also face enormous problems of scale. This paper describes the clustering technology of a new engine for data-intense analytical computing. The technology is designed to operate in highdimensional feature spaces without requiring dimensionality reduction. This enables the data engine to achieve high degrees of scalability and high interoperability between the analytical tasks. Most processes supported by the engine operate on a shared aggregate representation of data in the original feature space. |
8. |
J. Lukaszuk and R. Orlandic, "Efficient High-Dimensional Indexing by Superimposing Space-Partitioning Schemes," Intern'l Database Engineering and Applications Symposium
IDEAS'04, Coimbra, Portugal, 257–264, 2004. |
| Abstract: |
The problem of accessing data in high-dimensional spaces has attracted considerable scientific interest. Multidimensional access methods typically employ a spacepartitioning strategy whose goal is to direct the search toward relevant parts of the space. Unfortunately, monolithic
partitioning, whether it is static, dynamic or some hybrid scheme, has serious limitations in high-dimensional situations. This paper presents a new family of indexing techniques that superimposes static partitioning over a dynamic partitioning scheme in order to improve the search performance in high-dimensional spaces. A comprehensive set of experiments shows that, in realistic scenarios, the proposed solutions are usually much more efficient and more scalable than the multi-dimensional access methods based on monolithic partitioning. |
9. |
J. Lukaszuk and R. Orlandic, "On Accessing Data in High-Dimensional Spaces: A Comparative Study of Three Space Partitioning Strategies," Journal of Systems and Software 73(1):147–157, 2004. |
| Abstract: |
While experience shows that contemporary multi-dimensional access methods perform poorly in high-dimensional spaces, little is known about the underlying causes of this important problem. One of the factors that has a profound effect on the performance of a multi-dimensional structure in high-dimensional situations is its space partitioning strategy. This paper investigates the partitioning strategies of KDB-trees, the Pyramid Technique, and a new point access method called the Θs Technique. The paper reveals important dimensionality problems associated with these strategies and shows how each strategy affects the retrieval performance across a range of spaces with varying dimensionalities. The Pyramid Technique, which is frequently regarded as the state-of-the-art access method for high-dimensional data, suffers from numerous problems that become particularly severe with highly skewed data in heavily sparse spaces. Although the partitioning strategy of KDB-trees incurs several problems in high-dimensional spaces, it exhibits a remarkable adaptability to the changing data distributions. However, the experimental evidence gathered on both simulated and real data sets shows that the Θs Technique generally outperforms the other two schemes in high-dimensional spaces, usually by a significant margin. |
1. |
R. Orlandic and S. Kulkarni, “D*: A Data Storage and Retrieval System for Scientific Studies,”
Technical Document, 2006. |
| Abstract: |
D* is a novel system for data storage and retrieval appropriate for advanced scientific studies, as in high-energy physics, environmental sciences, and astronomy. The design of the D* system is based on certain principles of organizing and accessing multi-dimensional data on storage, whose pursuit requires that the storage system acquire a greater knowledge about the data. This provides a basis for a tighter integration of data storage and data mining technologies. Through the application of innovative retrieval and clustering techniques for high-dimensional data, D* can support high-performance data access and provide data mining applications useful insights into the data that can facilitate subsequent processes of data preparation and data mining. The basic processes of the D* system include data clustering, space partitioning, data loading, and data retrieval based on region queries and similarity searching. D* scales well with increasing data dimensionality and works well on incremental load of data. |
2. |
R. Orlandic, J. Lukaszuk, S. Kulkarni, and W.G. Yee,
“On Two Principles of Designing Multi-Dimensional Access Methods,” Technical Document, 2006. |
| Abstract: |
This paper addresses the problem of supporting region queries over a dynamically growing set of data in feature spaces with many dimensions. It formulates two principles of indexing data in high-dimensional feature spaces and presents an approach to high-dimensional indexing with these principles in mind. The combined goal of the two principles is to increase the density of index regions, which minimizes the number of accesses to persistent storage. In order to impose a desirable behavior on traditional multi-dimensional access methods that normally do not exhibit this behavior in high-dimensional spaces, the proposed approach applies a memory-resident structure with a special kind of space partitioning. The two principles are used not only as design objectives, but also as means of establishing the plausibility of the approach and experimental results. The results provide important evidence suggesting that a greater adherence to these principles leads to an improved retrieval performance. |
3. |
J. Lukaszuk, R. Orlandic and B. Yu, "Indexing Multi-Dimensional Data with Missing Values",
Technical Document, 2005. |
| Abstract: |
Advanced analytical studies are usually conducted on data with many dimensions. However, the large number of attributes associated with each data object naturally leads to situations where not all values are available. This paper presents a novel solution to the problem of retrieving multi-dimensional data with missing values based on region queries. The key aspect of the solution is that it effectively makes use of available information. In addition, its worst-case behavior is not worse than that of the underling indexing technique, and it can support effective prediction of query performance over incomplete data. While the solution is studied on a new indexing technique, it is a general solution that can be adopted by a broad range of multi-dimensional access methods. A comprehensive experimental study was conducted on real and synthetic data to monitor the efficiency and effectiveness of the technique. The study reveals that, when the incomplete data has sizable amount of available information, the technique outperforms sequential scan by a significant margin. Moreover, the indicators of both efficiency and effectiveness are comparable to those achieved on complete data. |
4. |
R. Orlandic, Y. Lai and W.G. Yee, "Clustering High-Dimensional Data Using an Efficient and Effective Data Space Reduction,", Technical Document, 2005. |
| Abstract: |
Clustering multi-dimensional data on storage in a way that maximizes the densities of storage clusters can greatly facilitate efficient and scalable retrieval of data. Clustering algorithms appropriate for this application must support efficient reduction of the data space that enables effective reduction of the search space. This paper introduces a new algorithm for clustering data in multi-dimensional feature spaces, called GARDENHD. Designed to support intelligent data storage, this algorithm employs an efficient data space reduction technique that scales both in terms of the number and dimensionality of data objects. It also produces a compact representation that can effectively capture the essence of the data. The algorithm is a hybrid of cell-based and density-based clustering. However, unlike typical clustering methods in its class, it applies a recursive partition of sparse regions in the space using a new space-partitioning strategy. The properties of this partitioning strategy greatly facilitate data space reduction. The experiments on synthetic and real data sets reveal that GARDENHD and its data space reduction are effective, efficient, and scalable. |
Adobe® Reader is required to view these document. |
 |
|
|
|
|
|