LOGOInterior GraphicData Mining

Quest Synthetic Data Generation Code

(0) Downloading and Compiling Tips:

  • Use "Save Link As..." (right mouse button with Netscape) to download the file. You can also click on the file name, and then use "File/Save...", but then the browser will try to display the file.

  • Use "uncompress" to uncompress the tar'd file, then use "tar -xvf" to extract the source code. This should work on any Unix system.

  • If you're using AIX (IBM's version of Unix), just run "make". If not, replace "xlC" with your favorite C++ compiler in the Makefile. Make sure the compiler treats these as C++ and not C files -- xlC for example requires the -+ flag to treat files with .c or .h extensions as C++ files.

  • While we tried to stick to standard C++, there are still subtle differences between systems. If you still experience problems compiling and can't figure out how to fix it, send an email to srikant@almaden.ibm.com.

  • If you are able to compile but find bugs in the program, send an email to srikant@almaden.ibm.com. If I can replicate the bug here, I can debug the program. However, in most cases, the bug may not appear here (since the programs have been used extensively). In that case, your're on your own.

  • Good luck!

(1) Associations and Sequential Patterns:


assoc.gen.tar.Z (26,286 bytes)

Downloading and Compiling Tips


   gen lit|tax|seq [options]
   gen lit|tax|seq -help     For more detailed list of options
lit: large (frequent) itemsets without taxonomies
tax: large (frequent) itemsets with taxonomies
seq: sequential patterns

Output Format:

There are two posssible output formats for the data file, based on whether or not the "-ascii" option is specified.
Consists of <CustID, TransID, NumItems, List-Of-Items.> Each of these is a 4-byte integer.

Each line contains a CustID, TransID, and Item. Each of these take up 10 bytes, for a total of 33 bytes per line.

Apart from the data file, this program also generates a pattern file. The pattern file has three parts:
  • A description of the data.

  • A list of items with high weights. (Recall that the weight corresponds to the probability that item will appear in an itemset.) Each line has the item number, followed by the weight.

  • A list of the itemsets/sequential patterns with high weight. (Recall that the weight corresponds to the probability that the itemset will appear in a transaction.) Each line has the weight, the expected confidence for rules generated from this itemset, and the itemset.

(2) Classification:


classification.gen.tar.Z (24,646 bytes)

Downloading and Compiling Tips


The synthetic data is for a person database in which each person has the nine attributes describted below.

     Attribute    Value
     ~~~~~~~~~    ~~~~~
     Salary       uniformly distributed from 20000 to 150000
     Commission   if Salary >= 75000, Commission = 0
                  else uniformly distributed from 10000 to 75000
     Age          uniformly distributed from 20 to 80
     Education    uniformly chosen from 0 to 4
     Car          make of the car, uniformly chosen from 1 to 20
     ZipCode      uniformly chosen from 9 available zipcodes
     HouseValue   uniformly distributed from 0.5*k*100000 to
                  1.5*k*100000, where 0 <= k <= 9 and depends on
                  the ZipCode
     YearsOwned   uniformly distributed from 1 to 30
     Loan         uniformly distributed from 0 to 500000
Attributes educationLevel, car, and zipCode are categorical, and the rest are numeric. The attribute values are randomly generated. There is a derived attribute also, called Equity, defined as follows:
          if YearsOwned < 20
                Equity = 0
                Equity = 0.1 * ( YearsOwned - 20 )
We developed a series of classification functions of increasing complexity that used the above attributes to classify people into different groups. Tuples in the training set were assigned the group label by first generating the tuple and then applying the classification function on the tuple to determine the group to which the tuple belongs.

It is rarely the case that the boundaries between the groups are very sharp. To model fuzzy boundaries, the data generation program takes a perturbation factor $p$ as an additional argument. After determining the values of different attributes of a tuple and assigning it a group label, the values for non-categorical attributes are perturbed. If the value of an attribute A for a tuple t v and the range of values of A is a, then the value of A for t after perturbation becomes v + r*p*a, where r is a uniform random variable between -0.5 and +0.5.


   pred [options]
   pred -help     For more detailed list of options

Output Format:

Each line contains a record of 54 bytes with 9 attributes.

www.ibm.comAlmaden CSOrdersSearchContact IBMCopyright
If you have any questions, comments or suggestions, please send a mail to the QUEST group.