Generating RDF from data.gov

From Data-gov Wiki

Jump to: navigation, search
Infobox (Tech Report) edit with form
  • name: Generating RDF from data.gov

  • description: technical details for constructing RDF data from data.gov datasets.
  • creator(s): Dominic DiFranzo,Li Ding
  • created: June 27,2009
  • modified: 2010-6-25


Contents

Overview

Many of the datasets in data.gov are available as tables (spreadsheets). This makes it easy to translate the datasets into RDF by generating a triple for each table cell where the row id is the subject, the column name is the predicate, and the cell content is the object. Our work adopted the following principles:

In the first principle, we minimize our translation by (i) preserving the functional structure of the original tables and (ii) skipping additional understanding of the cell content.

In the second principle, we keep the translated RDF friendly to Web users. RDF/XML was chosen because it makes the translated RDF readable by RDF parsers as well as XML parsers, and queryable by SPARQL as well as by Xquery. In order to let users surfing the translated RDF using web browsers and semantic web browsers (such as tabulator), we store the translated RDF in a collection of linked small-size partition files (<1MB each so as to avoid outofmemory problems).

Our third principle was approached by using a semantic wiki to host user contributed extensions. For example, every property created from table column name can be dereferenced to the RDF version of a Semantic MediaWik(SMW) page, and users can alter the definition of the property directly through the SMW editing interface.

In our fourth principle we preserve knowledge provenance of the converted RDF documents by embedding metadata about their sources, creators, and creation date time using the well-known Dublin Core and FOAF vocabularies.

To find more details, please go to http://data-gov.tw.rpi.edu/wiki/Generating_RDF_from_data.gov.

The Problem

Datasets at data.gov are organized in the following structure

  • Dataset: datasets are listed in catalog, there are about 400 of them
  • Document: each dataset may be published in several documents via XYZ_access_point link. Each document typically contains a table.
  • Table: a table contains a table header, and a collection of table row
    • Properties: table header (or column name), typically on the first row
    • DataEntry: table row,

Here is an example

the data in the document is in csv format, and below is an excerpt of the raw data.

URL,Title,Agency,Category,Date Released
http://www.data.gov/details/2,Patent Application Bibliographic Data (2009),US Patent and Trademark Office,Business Enterprise,15-Mar-2001
http://www.data.gov/details/3,Patent Grant Bibliographic Data (2009),US Patent and Trademark Office,Business Enterprise,January 1976
http://www.data.gov/details/4,Next Generation Radar (NEXRAD) Locations,National Oceanic and Atmospheric Administration,Geography and Environment,1991
  • Properties include:
 URL,Title,Agency,Category,Date Released
  • A DataEntry is a row, and it can be further parsed into pairs of (property, value):
URL=http://www.data.gov/details/2,
Title=Patent Application Bibliographic Data (2009),
Agency=US Patent and Trademark Office,
Category=Business Enterprise,
Date Released=15-Mar-2001

Translation Design

Principle 1: Keep the translation minimal

  • each table entry is identified by a URI, and its cells are mapped to RDF triples describing the URI.
  • the properties are directly created from the column names (mapping to existing semantic properties could take non-trivial time)
  • all non-header cells are mapped to a literal - we don’t create new URIs except the following

Principle 2: Let the translation meet the Web

  • we keep the translated file in RDF/XML format and ensure it can be consumed by RDF parser as well as XML parser and queries by SPARQL as well as XQuery.
  • a very big table (e.g. a csv file in 7g) should be either partitioned into multiple small files (or published by a scalable triple store) to avoid outofmemory exception in Web tools

Principle 3: Make the translation extensible

  • each property is mapped to a Semantic MediaWik(SMW) page under "Property" namespace, so users can contribute definition of property. e.g. set dgp:Title be subPropertyOf rdfs:label
  • each instance is mapped to a SMW page, and users can link to DBpedia.org definitions.
  • the dataset has a rdfs:seeAlso link to its homepage on SMW, so users can register additional derived dataset.
  • To link to other linked datasets, we can add asserting e.g. owl:equivalentProperty for properties, and owl:sameAs for instances.
  • self contained: users are not required to load external ontology to understand the generated RDF data files.


Principle 4: Preserve knowledge provenance

  • each dataset in RDF is extracted from an spreadsheet file
  • a dataset may be stored in many documents
  • a document is part of a dataset, generated at certain time.

Technical Details

parse raw data

parser for spreadsheet file

  • we use a simple CSV parser

organization of RDF files and notions

  • given a dataset X,
    • it has a id <DATASET-ID> and an alphanumber name <DATASET-NAME>
    • we use a base directory for all RDF files
 <DIR-DATASET-BASE>::= "http://data-gov.tw.rpi.edu/raw/<DATASET-ID>
  • there is always an <index.rdf> file, it is the entry point
 <URL-INDEX>::= <DIR-DATASET-BASE>/index.rdf
  • we maintain a single RDF file (in RDF/XML format) for the translated dataset
 <URL-COMPLETE-RDF>::= <DIR-DATASET-BASE>/data.rdf
  • we maintain a single RDF file (in N-TRIPLE format) for the translated dataset
 <URL-COMPLETE-NT>::= <DIR-DATASET-BASE>/data.nt

partition files

  • we only build partition for big RDF data. The partition files for the translated dataset are stored as a collection of small-size (<1MB) RDF files (in RDF/XML format).
  • data files store selection of DataEntries. Their size is controlled (<1MB).
  <URL-DATA-XYZ>::= <DIR-DATASET-BASE>/data_<0-PADDED-NUMBER>.rdf
  • link files store (i) links from a Dataset to data files; or (ii) links from a Dataset to link files. They are partitioned as well.
  <URL-LINK-XYZ>::= <DIR-DATASET-BASE>/link_<0-PADDED-NUMBER>.rdf

linking files

  • we link from index file to the two single RDF files
  • we link from index file to the <link files> or <data files> in partition files

DataEntry

  • we assign each dataentry an URI
 <URI-ENTRY-XYZ> ::= <URL-DATA-XYZ>#row_<0-PADDED-ROW-NUMBER>   #row number in the table


RDF translation

translate properties to RDF

  • we create a namespace for each dataset table
  <PROPOERTY-NS> ::= "http://data-gov.tw.rpi.edu/vocab/p/<DATASET-ID>/<FILEID>/<TABLEID>"
  • we normalize properties to form localname
    • replace all sequences of non-alphanumber characters with "_" (ensure valid IRI)
    • make it lower case (for better integration)
    • check for any pure number column names and add alpha characters to form correct URIs
    • capitalize the first character (required by SMW)
    • trim "_" on both end of the string (required by SMW) (done in the above line)
  • definition of properties will be stored in #index file


translate cells of DataEntry to RDF node

  • detect urls and hyperlinks, and convert them into RDF Resource
 hyperlink: preg_match("/^\s*<a +href=\"\s*([^\"]+)\s*\">\s*(ht|f)tps?[^<]+<\/a>\s*$/"
 url: preg_match('/\s*(h?[t|f]tps?:[^\s<>"\']+)/'
  • the rest cells are directly converted to RDF Literal without parsing (UT8-encoding, and XML escape maybe used)
  • translated cells are stored in #data file

Data and Additional Metadata

  • for all documents, add the following triples
<DOCUMENT-URL>  <rdf:type>  <foaf:Document>
<DOCUMENT-URL>  <dc:date>   "...."  # current time in XML:datatime format...
<DOCUMENT-URL>  <dc:creator> <http://tw.rpi.edu/>
  • for the index file <URL-INDEX>, add the following triples
// provenance
<URL-INDEX>#me  <rdf:type>  <dgtwc:Dataset>
<URL-INDEX>#me  <dc:source>  <ORIGINAL-DATAFILE-URL>      # where we got the dataset, there could be multiple files used
 <URL-INDEX>#me  <rdfs:isDefinedBy>    <PROPERTY-NS>URLSHA1/<SHA1-OF-ORIGINAL-DATAFILE-URL>  # resource (a wiki page (RDF/XML) corresponding to where we got the dataset)  
 <URL-INDEX>#me  <rdfs:seeAslo>    dg:URLSHA1/<SHA1-OF-ORIGINAL-DATAFILE-URL>  # resource (a wiki page (HTML) corresponding to where we got the dataset)  
 <URL-INDEX>#me  <rdfs:isDefinedBy>    <...>  # (optional: experimental dataset)  
<URL-INDEX>#me  <dc:date>  "..T..Z"      # (optional, the last modified time of the original dataset)

<URL-INDEX>#me  <rdfs:seeAlso>    dg:Dataset_<DATASET-ID>  #link to an RDF/XML page corresponding to the dataset
<URL-INDEX>#me  <rdfs:isDefinedBy>    wiki:Dataset_<DATASET-ID>  # link to a wiki page corresponding to where we can add annotation to the dataset

// some statistics
<URL-INDEX>#me  <dgtwc:number_of_entries>    "11"   # number of rows
<URL-INDEX>#me  <dgtwc:number_of_properties>    "21"  # number of columns
<URL-INDEX>#me  <dgtwc:number_of_triples>    "31"   # number of triples generated from table cell (excluding metadata)
 ...
// links
<URL-INDEX>#me  <dgtwc:complete_data>  <URL-COMPLETE-NT>  
<URL-INDEX>#me  <dgtwc:link_data>  <URL-LINK-ROOT>   # the root link file
//[DATA]: property definitions
<PROPERTY-NS><PROPERTY_NAME>  <rdf:type>   <rdf:Property>
<PROPERTY-NS><PROPERTY_NAME>  <rdf:label>   "un-normalized column header"
...
  • for each link file <URL-LINK-XYZ>, add the following triples
<URL-INDEX>#me  <dgtwc:partial_data>  <URL-DATA-XYZ>   # linking to the partial data file
...
<URL-LINK-ROOT>  <dgtwc:imports>  <URL-LINK-XYZ>   # importing a link file
...
  • for each data file <URL-DATA-XYZ>, add the following triples
<URL-DATA-XYZ>  <dgtwc:partOf>  <URL-INDEX>#me
  • for each table entry in , add the following triples
<URI-ENTRY-XYZ> <rdf:type>  <dgtwc:DataEntry>
//[DATA]: properties from table row
<URI-ENTRY-XYZ> <PROPERTY-NS><PROPERTY_NAME>   "...."
... 

add namespace prefix mapping

* dgtwc - "http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#"
* dc - "http://purl.org/dc/elements/1.1/"
* foaf - "http://xmlns.com/foaf/0.1/"

Example Results

index file

http://data-gov.tw.rpi.edu/raw/92/index.rdf

<?xml version="1.0"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:dgtwc="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#" > 
  <rdf:Description rdf:about="http://data-gov.tw.rpi.edu/raw/92/index.rdf">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
    <dc:date>2009-06-29T14:22:57Z</dc:date>
    <dc:creator rdf:resource="http://tw.rpi.edu/"/>
    
  </rdf:Description>
  <rdf:Description rdf:about="http://data-gov.tw.rpi.edu/raw/92/index.rdf#me">
    <rdf:type rdf:resource="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#Dataset"/>
    <dc:source rdf:resource="http://www.data.gov/data_gov_catalog.csv"/>
    <dc:date>2009-07-07T14:22:57Z</dc:date>
    <rdfs:isDefinedBy rdf:resource="http://data-gov.tw.rpi.edu/vocab/URLSHA1/365F337894D54C8945506E45F0B697A561DA76BF"/>
    <rdfs:seeAlso rdf:resource="http://data-gov.tw.rpi.edu/wiki/URLSHA1/365F337894D54C8945506E45F0B697A561DA76BF"/>
    
    <dgtwc:number_of_triples rdf:datatype="http://www.w3.org/2001/XMLSchema#long">18071</dgtwc:number_of_triples>
    <dgtwc:number_of_properties rdf:datatype="http://www.w3.org/2001/XMLSchema#long">52</dgtwc:number_of_properties>
    <dgtwc:number_of_entries rdf:datatype="http://www.w3.org/2001/XMLSchema#long">378</dgtwc:number_of_entries>

    <dgtwc:complete_data rdf:resource="http://data-gov.tw.rpi.edu/raw/92/data.rdf"/>
    <dgtwc:complete_data rdf:resource="http://data-gov.tw.rpi.edu/raw/92/data.nt"/>
    <dgtwc:link_data rdf:resource="http://data-gov.tw.rpi.edu/raw/92/link_root.rdf"/>
  </rdf:Description>


 <rdf:Description rdf:about="http://data-gov.tw.rpi.edu/raw/92/index.rdf#url">
  <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
  <rdfs:label>URL</rdfs:label>
 </rdf:Description>
 <rdf:Description rdf:about="http://data-gov.tw.rpi.edu/raw/92/index.rdf#title">
  <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
  <rdfs:label>Title</rdfs:label>
 </rdf:Description>

</rdf:RDF>

link file

http://data-gov.tw.rpi.edu/raw/92/link_root.rdf

<?xml version="1.0"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:dgtwc="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#" 
    xml:base = "http://data-gov.tw.rpi.edu/raw/92/link_root.rdf"  > 
  <rdf:Description rdf:about="">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
    <dc:date>2009-06-29T14:22:57Z</dc:date>
    <dc:creator rdf:resource="http://tw.rpi.edu/"/>
<!--
    <dgtwc:imports rdf:resource="http://data-gov.tw.rpi.edu/raw/92/link_00001.rdf"/>
-->
  </rdf:Description>
  <rdf:Description rdf:about="http://data-gov.tw.rpi.edu/raw/92/index.rdf#me">
    <dgtwc:partial_data rdf:resource="http://data-gov.tw.rpi.edu/raw/92/data.rdf"/>
  </rdf:Description>  
</rdf:RDF>

data file

http://data-gov.tw.rpi.edu/raw/92/data.rdf

<?xml version="1.0"?>
<rdf:RDF
    xmlns:dgtwc="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#" 
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns="http://data-gov.tw.rpi.edu/vocab/p/92/" 
    xml:base="http://data-gov.tw.rpi.edu/raw/92/data.rdf"> 
  <rdf:Description rdf:about="">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
    <dc:date>2009-06-29T14:22:57Z</dc:date>
    <dc:creator rdf:resource="http://tw.rpi.edu/"/>
  </rdf:Description>
  
 <rdf:Description rdf:about="#entry00002">
  <rdf:type rdf:resource="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry"/>
  <url rdf:resource="http://www.data.gov/details/2"/>
  <title>Patent Application Bibliographic Data (2009)</title>
  <agency>US Patent and Trademark Office</agency>
  <category>Business Enterprise</category>
  <date_released>15-Mar-2001</date_released>
  <date_updated>Thursdays</date_updated>
  <time_period>Single date of official pre-grant publication (each Thursday)</time_period>
  <frequency>weekly</frequency>
  <description>Contains the bibliographic text (i.e., front page) of each patent application (non-provisional utility and plant) published weekly (Thursdays) organized by Calendar Year (January through December). Excludes images/drawings. This file is a subset of the Patent Application Data/XML Version 4.2 ICE (Text Only). The file format is eXtensible Markup Language (XML) in accordance with the Patent Application Version 4.2 International Common Element (ICE) Document Type Definition (DTD).  Available on publication day (Thursdays). Approximately 2.7 MB per week (compressed). These product files are available at no charge from: <a href="https://eipweb.uspto.gov/2009/PatentApplBibICEXML">https://eipweb.uspto.gov/2009/PatentApplBibICEXML</a>  No order/order form is required. This product includes an ipabyyyymmdd_wknn.zip file for each week [where "yyyymmdd" is a Thursday publication date and "nn" is a two-digit, fixed-length number (with leading zero) representing the sequentially-numbered week of the year]. Within each weekly zip file are (3) files:  ipabyyyymmdd.xml (Bibliographic information in XML ICE); ipabyyyymmddlst.txt (List of published patent application numbers in ascending order); and ipabyyyymmddrpt.html (Statistical/summary report). For more information about these files:  <a href="http://www.uspto.gov/web/menu/patdata.html">http://www.uspto.gov/web/menu/patdata.html</a></description>
  <data_gov_data_category_type>Raw Data Catalog</data_gov_data_category_type>
  <specialized_data_category_designation>Administrative</specialized_data_category_designation>
  <keywords>Patent, Intellectual Property, Inventor, Innovation, patent application, free product files, bibliographic, federal data download, national, international, federal datasets, United States Patent and Trademark Office, US Patent and Trademark Office, IP, search, file, register, appeal, Department of Commerce, invention, rights</keywords>
  <citation>USPTO Patent Application Bibliographic Data <a href="https://eipweb.uspto.gov/2009/PatentApplBibICEXML/">https://eipweb.uspto.gov/2009/PatentApplBibICEXML/</a></citation>
  <agency_program_page>Patent Application Bibliographic Data <a href="http://www.uspto.gov/main/patents.htm">http://www.uspto.gov/main/patents.htm</a></agency_program_page>
  <agency_data_series_page rdf:resource="http://www.uspto.gov/web/menu/patdata.html"/>
  <unit_of_analysis>Patent Application</unit_of_analysis>
  <granularity>address-level</granularity>
  <geographic_coverage>National and international</geographic_coverage>
  <collection_mode>person/paper and person/computer</collection_mode>
  <data_collection_instrument rdf:resource="http://www.uspto.gov/web/patents/howtopat.htm"/>
  <data_dictionary_variable_list rdf:resource="http://www.uspto.gov/web/offices/ac/ido/oeip/sgml/st32/redbook/rb2004/rb2004.html"/>
  <applicable_agency_information_quality_guideline_designation>Department of Commerce/United States Patent and Trademark Office</applicable_agency_information_quality_guideline_designation>
  <data_quality_certification>Yes</data_quality_certification>
  <privacy_and_confidentiality>Yes</privacy_and_confidentiality>
  <technical_documentation rdf:resource="http://www.uspto.gov/web/offices/ac/ido/oeip/sgml/st32/redbook/rb2004/rb2004.html"/>
  <xml_access_point rdf:resource="https://eipweb.uspto.gov/2009/PatentApplBibICEXML/"/>
  <xml_file_size>2.7MB</xml_file_size>
 </rdf:Description>
</rdf:RDF>

Related work

csv2rdf4lod is an alternative approach to promoting data.gov tabular data to RDF.

Facts about Generating RDF from data.govRDF feed
Dcterms:created27 June 2009  +
Dcterms:creatorDominic DiFranzo  +, and Li Ding  +
Dcterms:descriptiontechnical details for constructing RDF data from data.gov datasets.
Dcterms:modified2010-6-25
Foaf:nameGenerating RDF from data.gov
Skos:altLabelGenerating RDF from data.gov  +, generating rdf from data.gov  +, and GENERATING RDF FROM DATA.GOV  +
Personal tools
internal pages