- 12
- 0
- 约5.95千字
- 约 6页
- 2017-10-18 发布于河北
- 举报
KDD CUP 2004_数据挖掘_科研数据集.pdf
KDD CUP 2004
英文关键词:
KDD CUP 2004,performance criteria,bioinformatics,quantum
physics,
中文关键词:
KDD 杯 2004 年业绩标准、 生物信息学、 量子物理,
数据格式:
TEXT
数据介绍:
This years competition focuses on data-mining for a variety of
performance criteria such as Accuracy, Squared Error, Cross Entropy, and
ROC Area. As described on this WWW-site, there are two main tasks
based on two datasets from the areas of bioinformatics and quantum
physics.
The file you downloaded is a TAR archive that is compressed with GZIP.
Most decompression programs (e.g. winzip) can decompress these
formats. If you run into problems, send us email. The archive should
contain four files:
phy_train.dat: Training data for the quantum physics task (50,000 train
cases)
phy_test.dat: Test data for the quantum physics task (100,000 test cases)
bio_train.dat: Training data for the protein homology task (145,751 lines)
bio_test.dat: Test data for the protein homology task (139,658 lines)
The file formats for the two tasks are as follows.
Format of the Quantum Physics Dataset
Each line in the training and the test file describes one example. The
structure of each line is as follows:
The first element of each line is an EXAMPLE ID that uniquely describes
the example. You will need this EXAMPLE ID when you submit results.
The second element is the class of the example. Positive examples are
denoted by 1, negative examples by 0. Test examples have a ? in this
position. This is a balanced problem so the target values are roughly half
0s and 1s.
All following elements are feature values. There are 78 feature values in
each line.
Missing values: columns 22,23,24 and 46,47,48 use a value of 999 to
denote not available, and columns 31 and 57 use 9999 to denote not
available. These are the column numbers in the data tables starting with
1 for the first column (the case ID numbers). If you remove the first two
columns (the case ID numbers and the ta
原创力文档

文档评论(0)