Header image   UIS  
for Data-Intensive Analytical Systems    
line decor
  
line decor
 
 
 
 

 
 
[
Data
]
 
 

Data to Download:

  • Indexing Data with Missing Values - data sets that were used to test the GammaEXT and GammaAPX techniques for indexing data with missing values.
  • Clustering Data - data sets that were used to test the GARDENHD clustering technique in multidimensional spaces.

Line

The data sets provided below, each with 1,000,000 points, have been used to test the GammaEXT and GammaAPX techniques for indexing data with missing values. The experiments are described in the paper "Indexing Multi-Dimensional Data with Missing Values".

Contents:

Notation:

The notation used to identify data sets is "t.d.p.v", where

  • t indicates data distribution;
  • d indicates data dimensionality;
  • p indicates the percentage of incomplete points;
  • v indicates the maximum number of missing values in a point.

For example, u.25.50.3 is a set of uniform 25-dimensional data, in which 50% of all points are incomplete and have up to 3 missing values.

Data Range:

All values (coordinates of points) are in the [0,1] range, with exception of missing values that are represented as -1. Each value of a point is separated by a comma.

Base Data Sets:

All data sets used in the experiments were derived from the following three sets, each with 1,000,000 25-dimensional points:

  1. r.25.29.4 - Real data set with missing values extrapolated from a database of a local company. Exactly 289,087 (approximately 29%) records have missing values in up to 4 dimensions, on average 2.6 missing values per record. Exactly 48,217 records have 1, 19,530 have 2, 195,750 have 3, and 25,590 have 4 missing values. All other records (total of 710,913) are complete. All missing values appear in 5 different dimensions.

  2. s.25.0.0 - Synthetic set of heavily skewed data with 10 clusters, each with very skewed internal distribution. Assuming the first five dimensions of a [0,1]^{25} space, we randomly selected 10 different corners of the 5-dimensional sub-space. In each of these "corners", we generated a cluster of data with the length 0.25 in the first 5 and the length 1 in the remaining 20 dimensions. For each dimension i <= 25 of every cluster, we randomly selected a real variable peaki, between 0.5 and 1, and an integer variable slopei, between 3 and 7. Then each coordinate i of a point was generated by multiplying peaki and the average of slopei random values between 0 and 1. The resulting value was scaled to fit the corresponding range of the given cluster. As a result of this construction, the values in all clusters along each dimension follow a normal distribution with different slope and randomly shifted peak. Each point in a sequence of 10 points in the data set comes from a different cluster.

  3. u.25.0.0 - Synthetic set of uniform data

Download Sets of Real Data:

Set of real data with missing values described above (130MB-uncompressed).

Real data set without missing values (135MB-uncompressed). Derived from r.25.29.4 by replacing each missing value with a random value among at most 10 different known values (including the minimum and maximum) in the corresponding dimension.

Download the Synthetic Data Generator:

The synthetic sets were generated using a data generator, which you can download below. Unzip all files to some local folder and execute (double click on) Run.bat.

List of Synthetic Data Sets Produced by the Generator:

  • s.25.0.0

Synthetic set of skewed data without missing values (194MB-uncompressed).

  • s.20.0.0

Synthetic set of skewed data without missing values (155MB-uncompressed).
Derived from s.25.0.0 by extracting first 20 dimensions.

  • s.15.0.0

Synthetic set of skewed data without missing values (117MB-uncompressed).
Derived from s.25.0.0 by extracting first 15 dimensions.

  • s.10.0.0

Synthetic set of skewed data without missing values (78MB-uncompressed).
Derived from s.25.0.0 by extracting first 10 dimensions.

  • s.5.0.0

Synthetic set of skewed data without missing values (39MB-uncompressed).
Derived from s.25.0.0 by extracting first 5 dimensions.

  • s.25.50.3

Synthetic set of skewed data with missing values (188MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • s.20.50.3

Synthetic set of uniform data without missing values (150MB-uncompressed).
Derived from s.20.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • s.15.50.3

Synthetic set of uniform data without missing values (117MB-uncompressed).
Derived from s.15.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • s.10.50.3

Synthetic set of uniform data without missing values (73MB-uncompressed).
Derived from s.10.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • s.5.50.3

Synthetic set of uniform data without missing values (35MB-uncompressed).
Derived from s.5.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • s.25.15.3

Synthetic set of uniform data without missing values (192MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 15% points.

  • s.25.30.3

Synthetic set of uniform data without missing values (191MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 30% points.

  • s.25.45.3

Synthetic set of uniform data without missing values (189MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 45% points.

  • s.25.60.3

Synthetic set of uniform data without missing values (188MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 60% points.

  • s.25.75.3

Synthetic set of uniform data without missing values (186MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 75% points.

  • s.25.50.5

Synthetic set of uniform data without missing values (186MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 5 values in randomly selected dimensions of the first 50% points.

  • s.25.50.10

Synthetic set of uniform data without missing values (180MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 10 values in randomly selected dimensions of the first 50% points.

  • s.25.50.15

Synthetic set of uniform data without missing values (174MB-uncompressed).
Derived from s.25.0.0 by replacing random number between 1 and 15 values in randomly selected dimensions of the first 50% points.

  • u.25.0.0

Synthetic set of uniform data without missing values (194MB-uncompressed).

  • u.20.0.0

Synthetic set of uniform data without missing values (155MB-uncompressed).
Derived from u.25.0.0 by extracting first 20 dimensions.

  • u.15.0.0

Synthetic set of uniform data without missing values (117MB-uncompressed).
Derived from u.25.0.0 by extracting first 15 dimensions.

  • u.10.0.0

Synthetic set of uniform data without missing values (78MB-uncompressed).
Derived from u.25.0.0 by extracting first 10 dimensions.

  • u.5.0.0

Synthetic set of uniform data without missing values (39MB-uncompressed).
Derived from u.25.0.0 by extracting first 5 dimensions.

  • u.25.50.3

Synthetic set of skewed data with missing values (188MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • u.20.50.3

Synthetic set of uniform data without missing values (150MB-uncompressed).
Derived from u.20.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • u.15.50.3

Synthetic set of uniform data without missing values (112MB-uncompressed).
Derived from u.15.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • u.10.50.3

Synthetic set of uniform data without missing values (73MB-uncompressed).
Derived from u.10.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • u.5.50.3

Synthetic set of uniform data without missing values (35MB-uncompressed).
Derived from u.5.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 50% points.

  • u.25.15.3

Synthetic set of uniform data without missing values (192MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 15% points.

  • u.25.30.3

Synthetic set of uniform data without missing values (191MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 30% points.

  • u.25.45.3

Synthetic set of uniform data without missing values (189MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 45% points.

  • u.25.60.3

Synthetic set of uniform data without missing values (188MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 60% points.

  • u.25.75.3

Synthetic set of uniform data without missing values (186MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 3 values in randomly selected dimensions of the first 75% points.

  • u.25.50.5

Synthetic set of uniform data without missing values (186MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 5 values in randomly selected dimensions of the first 50% points.

  • u.25.50.10

Synthetic set of uniform data without missing values (180MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 10 values in randomly selected dimensions of the first 50% points.

  • u.25.50.15

Synthetic set of uniform data without missing values (174MB-uncompressed).
Derived from u.25.0.0 by replacing random number between 1 and 15 values in randomly selected dimensions of the first 50% points.

Go to the top


©2007 Dstar