URI design for RDF conversion of CSV-based data

From Data-gov Wiki

Jump to: navigation, search
Infobox (Proposal) edit with form
  • name: URI design for RDF conversion of CSV-based data


The sharing of comma-separated files is very common, and a common set of problems arise upon the receipt and processing of data within this format. The RDF vocabulary described in this document provides interpretation instructions to a tool that converts tabular based data into RDF appropriate for linked data publishing. This work is discussed in the paper Triplify challenge 2010 - lebo and williams, and a shorter introduction (also linked from http://www.data.gov/semantic/index) can be found at A Proposal for Governmental Data URIs. The page csv2rdf4lod describes a Java-based tool that follows the interpretation vocabulary described here and provides a set of shell script tools to facilitate automatic conversion.

Contents

Colleague in Distress: An example scenario

A colleague asks you to help merge some datasets. He brings his computer and shows you his initial attempts. You ask for the data and receive six attachments via email with an extensive narrative of where the datasets came from, what they describe, and what your colleague is trying to determine from a merged version of the data. Immediately, there is a concern for the level of trust you (and your colleague) can place on these six files. Where did they come from? Who created them, how did they create them, and what has happened to them before they appeared in your inbox? If we set these concerns aside and focus on the merge, another set of concerns arise. Are entities mentioned in one data file identical to entities mentioned in another? How can we merge these so that we can answer interesting questions that can only be addressed using aspects from multiple datasets? What other datasets, beyond these six, could also be used to provide some interesting insights?

Revisiting the data source concern, you reply to your colleague:

Colleague,

Do you have any pointers to the original sources of this data? URLs of the official sites would be best.
Thanks!
-Me

You realize that even if you have the URLs for the datafiles, you'll need a bit more information to "get your bearings" for where the datasets are coming from.

Colleague,

Following up on my data file URL request email, could you also provide the following for each of the six files you sent:
* Name of organization that created/released the dataset (with URL to their homepage)
* The organization's name for the dataset (with a URL to its homepage)
* The organization's name for the version of the dataset (a release date, or something like "Quarter 1, 2010", etc. -- whatever they use to denote it)

Thanks,
-Me

You sit down together, and your colleage starts mentioning domain terms, but is citing which columns in which datasets correspond or relate.

e.g. 
column "UNITID" in dataset "epa-thermal-emissions" corresponds to column "GENCODE" in dataset "eia".

You frantically try to collect this stream of associations.

Overview of URI patterns created

URI examples are listed from most global to most local.

4 Raw parameters:

Base:    http://logd.tw.rpi.edu
Source:  data-gov    
Dataset: 1530
Version: 2009-10-08

Dataset URI: http://{base uri}/source/{source organization}/dataset/{dataset identifier}/version/{version identifier}

Dataset URI:				http://logd.tw.rpi.edu/source/data-gov/dataset/1530

# Version-independent vocabulary
Local Class:				http://logd.tw.rpi.edu/source/data-gov/dataset/1530/vocab/Person
Raw         Property:			http://logd.tw.rpi.edu/source/data-gov/dataset/1530/vocab/raw/requester_name
Enhancement Property:			http://logd.tw.rpi.edu/source/data-gov/dataset/1530/vocab/enhancement/1/requester_name
Enhancement Property:			http://logd.tw.rpi.edu/source/data-gov/dataset/1530/vocab/enhancement/2/requester_name
Enhancement Property:			http://logd.tw.rpi.edu/source/data-gov/dataset/1530/vocab/enhancement/3/requester_name

# Version-independent value space
Typed           Resource		http://logd.tw.tw.rpi.edu/source/data-gov/dataset/1530/type/person/Connolly_Ward
Property-scoped Resource:		http://logd.tw.tw.rpi.edu/source/data-gov/dataset/1530/value/requester_name/Connolly_Ward
Crutched        Resource:		http://logd.tw.rpi.edu/source/data-gov/dataset/1530/value/county/NY-Rensselaer	(raw:county = "Rensselaer", raw:state = "NY") (Property-scoped)

Dataset Version URI:			http://logd.tw.rpi.edu/source/data-gov/dataset/1530/version/2009-10-08

# Version-dependent value space (for rows)
Default     Row Instance:		http://logd.tw.rpi.edu/source/data-gov/dataset/1530/version/2009-10-08/thing_1
Primary Key Row Instance:		http://logd.tw.rpi.edu/source/data-gov/dataset/1530/version/2009-10-08/request/07-F-0001 (raw:request = "07-F-0001")

# Version-dependent value space (for values)
Untyped Implicit Bundle Resource:	http://logd.tw.rpi.edu/source/data-gov/dataset/1530/version/2009-10-08/location/thing_1	(conv:property_name = "location")
Typed   Implicit Bundle Resource:	http://logd.tw.rpi.edu/source/data-gov/dataset/1530/version/2009-10-08/location/point_1	(conv:property_name = "location", conv:type_name = "point")

Dereferencing

Dataset URI:				http://logd.tw.rpi.edu/source/data-gov/dataset/1530                           (returns conv:hasVersion)

Dataset Version URI:			http://logd.tw.rpi.edu/source/data-gov/dataset/1530/version/2009-10-08        (returns conv:numTriples, conv:usesProperty, conv:hasDumpFile, conv:inSPARQLEndpoint)
Dataset Version Dump File:		http://logd.tw.rpi.edu/source/data-gov/file/1530/data-gov-1530-2009-10-08     (content negotiate with this)

Typed Resource				http://logd.tw.rpi.edu/source/data-gov/dataset/1530/type/person/Connolly_Ward (redirects)
Typed Resource Dump File:		http://logd.tw.rpi.edu/source/data-gov/file/1530/type/person/Connolly_Ward    (content negotiate with this)

Example CSV input files

The example CSV files listed here exhibit characteristics that the conversion process should handle. Each conversion parameter will be described by showing their effect when processing one or more of these example inputs. A real-life example from the Data.gov datasets will also be used to help justify the requirement.

Example Input 1

Name,Birth date,Death year,Wife,Number of children,Height (inches)
George Washington,02-22-1732,1799,Mary Ball Washington,4,72

Example Input 1.2

This example represents a new "version" of #Example Input 1. "Versions" have the same source and dataset identifier, but have different "underlying data". The structure across versions should be almost identical. A version that is too different should become a dataset in its own right by receiving a new dataset identifier.

Changes since the last version (#Example Input 1):

  • Washington's height changed.
  • A new entry was added for Alexander Hamilton.
  • Alexander Hamilton's Death year is missing
  • A table title was added, pushing the header to line 2 instead of line 1.
  • Nothing is on line 4, and nothing should result in the row.
Table 2.4 - Information about Presidents of the United States,,,,,
Name,Birth date,Death year,Wife, Number of children,Height (inches)
George Washington,02-22-1732,1799,Mary Ball Washington,4,75.5

Alexander Hamilton,01-11-1755,,Elizabeth Schuyler Hamilton,8,67

Example Input 2

Husband,Wife,Wedding Date
George Washington,Mary Ball Washington,01-06-1890

URI construction

URIs should follow http://tools.ietf.org/html/rfc3986

Datasets are named using URIs created by the concatenation of the following four components:

  • Base URI (e.g. "http://logd.tw.rpi.edu")
  • Source identifier (e.g. "data-gov", "census-gov")
  • Source's dataset identifier (with optional slash-separated sub-levels) (e.g. "1627", "1612/Army", "1612/Air_Force", "csp-asec/gp.series")
  • Dataset version identifier (by convention, a publication date, last-modified date, or counting number) (e.g. "1", "2010-Apr-09")
    • If a counting number is used, be sure that the last-modified date and sha1 are associated to address ambiguity issues.
    • Avoid using a "date obtained" for the version identifier

Putting all of the components together, the URI for the dataset becomes:

http://logd.tw.rpi.edu
data-gov
1627
2010-Apr-09

http://logd.tw.rpi.edu/data-gov/dataset/1627/version/2010-Apr-09/

The construction of a dataset's URI is illustrated in the following:

The components used to name the dataset are described in the following RDF. A blank node is used because the dataset is not yet named. Its name is created using the components provided.

@prefix conversion: <http://purl.org/twc/vocab/conversion/> .
@prefix :           <http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/params/raw/> .

:dataset a void:Dataset;
   conversion:base_uri           "http://logd.tw.rpi.edu"^^xsd:anyURI;
   conversion:source_identifier  "data-gov";
   conversion:dataset_identifier "1627";
   conversion:dataset_version    "2010-Apr-09";
.

The URI for the dataset also becomes a namespace for the vocabulary used and the instances mentioned in the dataset.

Vocabulary namespace

The vocabulary is within the following namespace. It does not incorporate the version because the structure should be the same when the underlying data changes. This allows applications to access newer versions of the data without modifications to their queries. This is tolerant of adding and removing columns, but will not be tolerant if the meaning of a column changes without changing its name.

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/

Several vocabularies are created for a single dataset, corresponding to the original "raw" conversion from CSV to RDF and subsequent (optional) enhancement steps. The namespaces for the enhancement vocabularies are created using the enhancment identifer, which is provided as an enhancement parameter (see #enhancement identifier). The property namespace for the raw conversion, and three property namespaces for enhancments 1, 2, and 3 are listed below.

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/raw/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/1/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/2/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/3/

The local name for property URIs are created using the label in the CSV header, subject to tolower(s/\W/_/g). If a column header is not present, column_N will be used, where N is the column number (starting with 1). If a column header duplicates a previous column's header, the same property local name will result with _N appended, where N = D + 1 and D is the Dth occurrence of the duplicate label (e.g., name and name_2). The string used to create the local name of a property may also be specified with an enhancement parameter (see #pref_property_label). The properties originating from a CSV column labeled "Received Date" are listed below. The first is from the raw conversion from CSV to RDF, and the subsequent properties are created and used in subsequent enhancements.

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/raw/name
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/1/received_date
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/2/received_date
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/3/received_date

When a second version of the dataset is converted from CSV to RDF, the namespace for the vocabulary will still be:

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/raw/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/raw/received_date

When either the first or second version of the dataset is enhanced by enhancement identifier 1, the namespace for the vocabulary will become the following.

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/1/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/1/received_date

A note when providing this as linked data. The raw:name and e3:name values will be most interesting to consumers, so only these two should be provided for casual requests. However links should lead to traces that contains all values raw:name, e1:name, e2:name, and e3:name.

Instances namespace

The dataset URI is used as the namespace for the URIs of the instances created for this dataset. The following URIs name the dataset and two instances within the dataset. A minimal method for naming instances is to use a counter, but if enhancement parameters are provided, the names can become more recognizable (see #Primary_key_column_parameter). The counter used counts the number of valid rows processed, NOT the row number of the CSV. Since this information is of interest and value, the row number is automatically associated with the instances using the ov:csvRow property (see #csvRow).

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/thing_1
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/George_Washington

The URIs for instances created from cell values are placed within the following namespace. This is done to avoid collision between subjects and objects.

http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/values/
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/values/George_Washington
http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/values/Mary_Ball_Washington

(TODO: value/discriminator/ and value/additional/, from conversion:subjectDiscriminator and conversion:AdditionalDescription, respectively)

DC descriptions

From Data collection's base URI

#Data collection's base URI

From Data source identifier

#Data source identifier


The value of conversion:source_identifier is used to create a URI (e.g., <http://logd.tw.rpi.edu/source/data-gov>).

...
dcterms:contributor <http://logd.tw.rpi.edu/source/data-gov> .
<http://logd.tw.rpi.edu/source/data-gov> dcterms:identifier "data-gov" .

@DEPRECATED dcterms:source used to be asserted, but has been replaced by dcterms:contributor. This is because we were mis-using the meaning of dcterms:source.


From Data source's identifier for dataset

#Data source's identifier for dataset

From Dataset version

#Dataset version

VoID descriptions

http://rdfs.org/ns/void-guide

The following diagram illustrates the different void:subset hierarchies that are created when a dataset has a single CSV (e.g. http://logd.tw.rpi.edu/source/data-gov/dataset/1008) vs. multiple CSVs (e.g. http://logd.tw.rpi.edu/logd.tw.rpi.edu/source/data-gov/dataset/1033). The void:dataDump is always asserted at the conversion level, but the conversion:num_triples is always asserted at the lowest level. In the single-csv case, these are the same levels, while in the multi-csv case they are not.

The following diagram illustrates the three levels of a VoID hierarchy when Dataset 1008 is converted with raw and enhancement parameters. The root void:Dataset is the dataset that data.gov named as "1008". The second level contains all versions of the dataset as well as the void:LinkSet. The only version dataset shown (2010-Jul-21) was converted as raw and was enhanced, creating two datasets at the third level. Each of these cites a void:exampleResource that was created during the conversion.


The following diagram illustrates the four levels of a VoID hierarchy when Dataset 1033 is converted with raw and enhancement parameters. The fourth level is created because the dataset comprises CSVs. The first three levels are analogous to the one-csv example just discussed for Dataset 1008. However, a fourth level is created for the datasets created from each CSV that Dataset 1003 provides. The blue highlight in the dataset URIs are the values of the conversion:subjectDiscriminator parameter and are placed immediately after the dataset identifier (the automation uses the file name of the csv file, but any value for conversion:subjectDiscriminator can be used). Note that the void:dataDump descriptions remain on the third level -- dumps are not provided at the per-csv granularity.

Annotations on rows and columns

Instances created for rows and columns are annotated with the properties ov:csvRow and ov:csvCol. These properties do not get promoted during enhancement. (ov is currently being used, but conv could be used.)

@prefix ov:   <http://open.vocab.org/terms/> .
@prefix conv: <http://purl.org/twc/vocab/conversion/> .

<http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/3/received_date>
   ov:csvCol 1;
.

<http://logd.tw.rpi.edu/source/data-gov/dataset/1627/version/2010-Apr-09/thing_1>
   ov:csvRow 2;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1627/vocab/enhancement/3/received_date> "2007-12-19"^^xsd:date 
.

Mappings between versions

Dataset 8 is updated daily.

Dataset 32 is updated hourly, Dataset 33 is updated daily, Dataset 34 is updated weekly.

URIs for rows need to be different across these versions.

e.g., Dataset 1491 version 2009-May-18 and 2010-Jan-20

Background

D2RQ has a URI construction scheme.

Provenance of datafile conversions contains an early discussion that lead to the material on this page. It starts to address the need for separating the names for different enhanced versions of the datasets. The solutions proposed here are more eloquent with respect to maintaining identity among the subjects and values among the enhancements.

Terminology

  • Organization (i.e. foaf:Organization)
    • Source Organization - Organization providing the dataset
    • Converting Organization - Organization converting the Source Organization's dataset.
  • Dataset - A set of similarly-structured records describing some aspect of the world.
  • Version - Corresponding to a "Release" or (traditional) holistic "Publication" of a Dataset. Different Versions of a dataset may have slightly different columnar structure and may have different underlying data (rows or values). For example, the Supreme Court Database publishes quarterly Versions.
  • Conversion - Different Conversions of the same Dataset Version will have identical columnar structure and will have identical underlying data. A Conversion is one of potentially many RDF graphs resulting from a single Version of a Dataset. The two primary conversions are "raw" and "e1", but many more enhancement conversions may be produced. A Conversion is related to a specific Version. The only URIs that change in a conversion are the predicate namespaces.
    • Raw conversion - The first Conversion of a Dataset Version. All RDF triples contain untyped literal values.
    • Enhanced conversion - The secondary Conversions of a Dataset Version. RDF triples may contain typed and resource values.
  • Underlying data - can change between Dataset Versions, but not between Conversions of a Dataset Version.
  • Top Matter - Content in a CSV that comes before the Header and Data Rows. Usually contains a title or a caption.
  • Header (a.k.a. column headers, header row) - Usually the first row in a CSV file that labels the data columns.
  • Data Row - A CSV row that contains data. Constitutes the majority of rows in a CSV file.
  • Bottom Matter - Content in a CSV that comes after the Data Rows. Usually contains footnotes or legends.
  • First row (i.e. row 1) - The very first row of the CSV file. Counting is one-based for the benefit of the data curator.
  • internal dataset namespace - the namespace created for this dataset. Since conversion is intended to be self-containing and linked later, predicates, classes, and instances are created within the internal dataset namespace.
  • Property - Binary directed relation. aka Predicate.
  • Property local name - The local name of the property created for the column. If conversion:label is not specified, the local name is created from the csv header. If the csv header is not present, it is named "column_n", where n is the index of the column.

Assumed prefix definitions

Throughout this document, the following namespaces are used:

@prefix rdf:        <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:       <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:        <http://www.w3.org/2001/XMLSchema#> .
@prefix dcterms:    <http://purl.org/dc/terms/> .
@prefix scovo:      <http://purl.org/NET/scovo#> .
@prefix void:       <http://rdfs.org/ns/void#> .
@prefix ov:         <http://open.vocab.org/terms/> .

For brevity, these namespaces are used in examples without re-definition.

Style conventions

  • red font is used to indicate that a particular conversion parameter or resulting triple/value is new or important for the current example.
  • blue font is used to indicate that a particular resource is identical before and after enhancement.
  • gray font is used for enhancement rdf:type triples that can be inferred from other enhancement parameters (usually those in red font).

Implementations

  • Java version (most mature) - see Csv2rdf4lod for errors that this reports.
  • Perl version

Related work

Tools:

Blog discussions:

Parameters

see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Enhancement-Parameters-Reference

Facts about URI design for RDF conversion of CSV-based dataRDF feed
Dcterms:creatorTim Lebo  +, and Gregory Todd Williams  +
Dcterms:modified2011-1-21
Foaf:nameURI design for RDF conversion of CSV-based data
Skos:altLabelURI design for RDF conversion of CSV-based data  +, uri design for rdf conversion of csv-based data  +, and URI DESIGN FOR RDF CONVERSION OF CSV-BASED DATA  +
Personal tools
internal pages