The ALASKA Benchmark is an end-to-end benchmark for Big Data Integration tasks. ALASKA Datasets are organized into product category verticals. For each vertical we provide:
The ALASKA Benchmark supports the following integration tasks:
The ALASKA Benchmark datasets consist of HTML web pages, collected from different web sources, and JSON extracted product specifications. You can download our datasets from the Downloads section.
Example of product specification
{
"<page title>": "Samsung Smart WB50F Digital Camera White Price in India with Offers & Full Specifications | PriceDekho.com",
"brand": "Samsung",
"dimension": "101 x 68 x 27.1 mm",
"display": "LCD 3 Inches",
"pixels": "Optical Sensor Resolution (in MegaPixel)\n16.2 MP"
"battery": "Li-Ion"
}
Product specifications we have collected have the following properties:
In the following sections we will present our datasets, providing a pre-integration profiling.
CAMERA dataset contains 29,787 specifications collected from 24 web sources. The specifications contain ~4.6k distinct attribute names.
The CAMERA datasets contains 2 head (i.e. with a high number of specifications) sources and 22 tail (i.e., with a medium/low number of specifications) sources.
MONITOR dataset contains 16,662 specifications collected from 26 web sources. Specifications contains ~3.7k distinct attribute names.
The MONITOR datasets contains 2 head (i.e. with a high number of specifications) sources and 24 tail (i.e., with a medium/low number of specifications) sources.
We manually curated an extensive collection of labelled representative samples for each task in the ALASKA benchmark.
For the Entity Resolution task we needed to identify which specifications represents the same real-world entity (e.g. Canon EOS D50).
The methodology we used to creating the Entity Resolution ground truth is the following:
We provide as labelled data random subsets of the above ground truth with different size, dubbed SMALL, MEDIUM, LARGE and X-LARGE
.Each dataset is provided in a CSV format with three columns: "left_spec_id", "right_spec_id" and "label".
The "spec_id" is a global identifier for a specification and is the concatenation of the source name with the json number of the specification, separated by a special character "//" (e.g. "www.ebay.com//1000" is the global identifier for the 1000.json file inside www.ebay.com directory).
Each row of the CSV file represents a pair of specifications. Label=1 means the row is a matching pair, whereas label=0 means the row is a non matching pair.
The methodology to create such labelled sets is the following:
N.B. Every larger labelled set includes the smaller ones (e.g. the LARGE labelled set includes the SMALL and the MEDIUM).