Version 1.0.2 (October 4, 2016)
The WikiTableQuestions dataset is for the task of question answering on semi-structured HTML tables as presented in the paper:
Panupong Pasupat, Percy Liang.
Compositional Semantic Parsing on Semi-Structured Tables
Association for Computational Linguistics (ACL), 2015.
More details about the project: https://nlp.stanford.edu/software/sempre/wikitable/
Many files in this dataset are stored as tab-separated values (TSV) with the following special constructs:
List items are separated by
The following characters are escaped:
\n), backslash (
\\), and pipe (
Note that pipes become
\p so that doing
x.split('|') will work.
Consecutive whitespaces (except newlines) are collapsed into a single space.
data/ directory contains the questions, answers, and the ID of the tables
that the questions are asking about.
Each portion of the dataset is stored as a TSV file where each line contains one example.
Dataset Splits: We split 22033 examples into multiple sets:
Training data (14152 examples)
Test data -- the tables are not seen in training data (4344 examples)
Additional data where the tables are seen in training data. (3537 examples)
(Initially intended to be used as development data, this portion of the
dataset has not been used in any experiment in the paper.)
For development, we split
training.tsv into random 80-20 splits.
Within each split, tables in the training data (
and the test data (
random-split-seed-*-test) are disjoint.
The first 300 training examples.
The first 300 training examples annotated with gold logical forms.
For our ACL 2015 paper:
In development set experiments:
we trained on
and tested on
In test set experiments:
we trained on
training and tested on
*.examplesfiles: The LispTree format of the dataset is used internally in our SEMPRE code base. The
*.examplesfiles contain the same information as the TSV files.
csv/ directory contains the extracted tables, while the
contains the raw HTML data of the whole web page.
Comma-separated table (The first row is treated as the column header)
The escaped characters include:
double quote (
\") and backslash (
Newlines are represented as quoted line breaks.
Tab-separated table. The TSV escapes explained at the beginning are used.
Human-readable column-aligned table. Some information was loss during
data conversion, so this format should not be used as an input.
Formatted HTML of just the table
Raw HTML of the whole web page
Metadata including the URL, the page title, and the index of the chosen table.
(Only tables with the
wikitable class are considered.)
The conversion from HTML to CSV and TSV was done using
Its dependency is in the
Questions and tables are tagged using CoreNLP 3.5.2. The annotation is not perfect (e.g., it cannot detect the date "13-12-1989"), but it is usually good enough.
Tagged questions. Each line contains one example.
Tab-separated file containing the CoreNLP annotation of each table cell.
Each line represents one table cell.
The following fields are optional:
Header cells do not have these optional fields.
evaluator.py is the official evaluator.
Usage: evaluator.py <tagged_dataset_path> <prediction_path>
tagged_dataset_path should be a dataset .tagged file containing the
prediction_path should contain predictions from the model.
Each line should contain
ex_id item1 item2 ...
If the model does not produce a prediction, just output
Note that the resulting scores will be different from what SEMPRE produces as SEMPRE also enforces the prediction to have the same type as the target value, while the official evaluator is more lenient.
1.0 - Fixed various bugs in datasets (encoding issues, number normalization issues)
0.5 - Added evaluator
0.4 - Added annotated logical forms of the first 300 examples /
Renamed CoreNLP tagged data as
tagged to avoid confusion
0.3 - Repaired table headers / Added raw HTML tables / Added CoreNLP tagged data
0.2 - Initial release
For questions and comments, please contact Ice Pasupat email@example.com