Header image   UIS  
for Data-Intensive Analytical Systems    
line decor
  
line decor
 
 
 
 

 
 
[
Data
]
 
 

Data to Download:

  • Indexing Data with Missing Values - data sets that were used to test the GammaEXT and GammaAPX techniques for indexing data with missing values.
  • Clustering Data - data sets that were used to test the GARDENHD clustering technique in multidimensional spaces.

Line

The data sets provided below have been used to test the GARDENHD clustering technique for multidimensional data. The experiments are described in the paper "Clustering High-Dimensional Data Using an Efficient and Effective Data Space Reduction". The data sets included below are normalized and formatted according to the GARDENHD input format (see GARDENHD User Manual ).

DOWNLOADS:

  • Download the real data set: covtypeNorm.zip
  • Download the synthetic data set: animalsNorm.zip
  • Download the synthetic data generator: DataGenerator.zip
    (Unzip this folder and double click DataGenertor.bat to generate sets of synthetic data)

 

More detailed information about the data sets:

  • Real Data Set
    "covtype"

The real data set “covtype” is obtained from the UCI Machine Learning Repository (www.ics.uci.edu/~mlearn/MLRepository.html). It has 581,012 points with 54 dimensions (the 55th dimension records the class information of objects). There are 7 classes in this data set, each of which represents one type of tree.

  • Synthetic Data Set
    "animals"

The synthetic data set “animals” with 500,000 points is produced by the “animals.c” program obtained from the UCI Machine Learning Repository (www.ics.uci.edu/~mlearn/MLRepository.html). This data set has 72 dimensions (the 73rd dimension records the class information). There are 4 classes in this data set, each of which represents one type of animal.

  • Synthetic Data Sets
    "Center-Corners"

One group of 10 synthetic data sets with 100,000 points and varying dimensionality from 10 to 100. The other group of 5 synthetic 10-dimensional data has varying number of points from 100,000 to 500,000. All data sets in this group have “Center-Corners” distribution, in which one generated hyper-rectangle is placed in the center and others in 10 different corners of the space (origin, far corner, and 8 randomly selected corners). All generated hyper-rectangles have the same density. Moreover, each of the hyper-rectangles has uniform internal distribution and represents a different class of data. Thus, each point is assigned to one of 11 classes. These synthetic data sets were produced by the DataGenertor.bat program provided above.

List of Data Sets Produced by the Generator:

  • covtypeNorm

Normalized and cleaned set of real data "covtype" (100MB-uncompressed).

  • animalsNorm

Normalized and cleaned set of synthetic data "animals" (223MB-uncompressed).

  • points10dCCNorm

Normalized synthetic set of 100,000 points in 10 dimensional space with "Center-Corners" distribution (9MB-uncompressed).

  • points20dCCNorm

Normalized synthetic set of 100,000 points in 20 dimensional space with "Center-Corners" distribution (18MB-uncompressed).

  • points30dCCNorm

Normalized synthetic set of 100,000 points in 30 dimensional space with "Center-Corners" distribution (28MB-uncompressed).

  • points40dCCNorm

Normalized synthetic set of 100,000 points in 40 dimensional space with "Center-Corners" distribution (37MB-uncompressed).

  • points50dCCNorm

Normalized synthetic set of 100,000 points in 50 dimensional space with "Center-Corners" distribution (46MB-uncompressed).

  • points60dCCNorm

Normalized synthetic set of 100,000 points in 60 dimensional space with "Center-Corners" distribution (55MB-uncompressed).

  • points70dCCNorm

Normalized synthetic set of 100,000 points in 70 dimensional space with "Center-Corners" distribution (64MB-uncompressed).

  • points80dCCNorm

Normalized synthetic set of 100,000 points in 80 dimensional space with "Center-Corners" distribution (73MB-uncompressed).

  • points90dCCNorm

Normalized synthetic set of 100,000 points in 90 dimensional space with "Center-Corners" distribution (82MB-uncompressed).

  • points100dCCNorm

Normalized synthetic set of 100,000 points in 100 dimensional space with "Center-Corners" distribution (91MB-uncompressed).

  • points10dCC100kNorm

Normalized synthetic set of 100,000 points in 10 dimensional space with "Center-Corners" distribution (9MB-uncompressed).

  • points10dCC200kNorm

Normalized synthetic set of 200,000 points in 10 dimensional space with "Center-Corners" distribution (19MB-uncompressed).

  • points10dCC300kNorm

Normalized synthetic set of 300,000 points in 10 dimensional space with "Center-Corners" distribution (28MB-uncompressed).

  • points10dCC400kNorm

Normalized synthetic set of 400,000 points in 10 dimensional space with "Center-Corners" distribution (38MB-uncompressed).

  • points10dCC500kNorm

Normalized synthetic set of 500,000 points in 10 dimensional space with "Center-Corners" distribution (47MB-uncompressed).


©2007 Dstar