Quest Synthetic Data Generation Code
- Use "Save Link As..." (right mouse button with Netscape) to download
the file. You can also click on the file name, and then use
"File/Save...", but then the browser will try to display the file.
- Use "uncompress" to uncompress the tar'd file, then use "tar -xvf"
to extract the source code. This should work on any Unix system.
- If you're using AIX (IBM's version of Unix), just run "make".
If not, replace "xlC" with your favorite C++ compiler in the
Makefile. Make sure the compiler treats these as C++ and not C
files -- xlC for example requires the -+ flag to treat files
with .c or .h extensions as C++ files.
- While we tried to stick to standard C++, there are still subtle
differences between systems. If you still experience problems
compiling and can't figure out how to fix it, send an email to
- If you are able to compile but find bugs in the program, send
an email to
If I can replicate the bug here, I can debug the program.
However, in most cases, the bug may not appear here (since the
programs have been used extensively). In that case, your're on
- Good luck!
assoc.gen.tar.Z (26,286 bytes)
Downloading and Compiling Tips
gen lit|tax|seq [options]
gen lit|tax|seq -help For more detailed list of options
lit: large (frequent) itemsets without taxonomies
tax: large (frequent) itemsets with taxonomies
seq: sequential patterns
There are two posssible output formats for the data file, based on whether or not the
"-ascii" option is specified.
Apart from the data file, this program also generates a pattern file.
The pattern file has three parts:
- Consists of <CustID, TransID, NumItems, List-Of-Items.>
Each of these is a 4-byte integer.
- Each line contains a CustID, TransID, and Item. Each of these take
up 10 bytes, for a total of 33 bytes per line.
- A description of the data.
- A list of items with high weights. (Recall that the weight
corresponds to the probability that item will appear in an itemset.)
Each line has the item number, followed by the weight.
- A list of the itemsets/sequential patterns with high weight. (Recall
that the weight corresponds to the probability that the itemset will
appear in a transaction.) Each line has the weight, the expected
confidence for rules generated from this itemset, and the itemset.
Downloading and Compiling Tips
- The synthetic data is for a person database in which each person has the nine
attributes describted below.
Salary uniformly distributed from 20000 to 150000
Commission if Salary >= 75000, Commission = 0
else uniformly distributed from 10000 to 75000
Age uniformly distributed from 20 to 80
Education uniformly chosen from 0 to 4
Car make of the car, uniformly chosen from 1 to 20
ZipCode uniformly chosen from 9 available zipcodes
HouseValue uniformly distributed from 0.5*k*100000 to
1.5*k*100000, where 0 <= k <= 9 and depends on
YearsOwned uniformly distributed from 1 to 30
Loan uniformly distributed from 0 to 500000
Attributes educationLevel, car, and zipCode
are categorical, and the rest are numeric. The attribute values are randomly
generated. There is a derived attribute also, called Equity, defined as
if YearsOwned < 20
Equity = 0
Equity = 0.1 * ( YearsOwned - 20 )
We developed a series of classification functions of increasing
complexity that used the above attributes to classify people
into different groups.
Tuples in the training set were assigned the group label by first
generating the tuple and then applying the classification function on
the tuple to determine the group to which the tuple belongs.
It is rarely the case that the boundaries between the groups are
very sharp. To model fuzzy boundaries, the data generation
program takes a perturbation factor $p$ as an additional argument.
After determining the values of different attributes of a tuple and
assigning it a group label, the values for non-categorical
attributes are perturbed. If the value of an attribute A for a tuple t
v and the range of values of A is a, then the value of A for t
after perturbation becomes v + r*p*a, where r is a uniform
random variable between -0.5 and +0.5.
pred -help For more detailed list of options
- Each line contains a record of 54 bytes with 9 attributes.