How to generate data-gov data

From Data-gov Wiki

Jump to: navigation, search
Infobox (How-To) edit with form
  • name: How to generate data-gov data

  • description: a step-by-step tutorial for converting structured government data into data-gov RDF data
  • creator(s): Dominic DiFranzo,Li Ding
  • created: 2009/08/21
  • modified: 2010-5-17


Contents

Overview

This tutorial is intended to teach the process of converting data in CSV form to RDF form, and dumping RDF data to a triple store using the tools provided in data-gov group. The generated RDF data can then be rendered by a Google Visualization.

Note:

  • This tutorial was originally designed for the data-gov project insiders to convert CSV data into RDF, and it is now extended to help more users who don't have access to the data-gov project server to do a similar conversion.
  • The data conversion is going to generate a minimal RDF conversion, please stay tuned for our advances on conversion toola.

CSV file format requirement

  • Every csv file must have a header row
  • the number of columns in each row (including the header row) must be the same.

General Translation Strategy

  • Each element in header row will be translate to a RDF property in the output file.
  • Every element in non-header row will yield a triple. The rdf:subject is the URI based on an automatically generated unique row id, the rdf:predicate is the RDF property on the same row, and the rdf:object is the cell itself.

Web based CSV conversion

Data-gov offers an online RESTful web service for converting CSV data into RDF. It only deals with some CSV files, and is limited by input file size. Please use it with caution and always use it on small size CSV data (less than 1M). For larger CSV generation. please use #JAVA based CSV conversion.

To use the service, you may go to http://data-gov.tw.rpi.edu/ws/csv2rdf.html, and use the form in the page to run dataset conversion.

You may also directly call the service by forming a RESTful URI using the following parameters. Here are two examples

parameter meaning
url the URL of csv file
xmlbase the xmlbase URL for the converted RDF file. Affecting namespace of generated URIs.
propns the namespace URL of converted property. Affecting naemspace of generated properties.
output output format: rdf/xml(default), nt.


JAVA based CSV conversion

pre-requisits

  • we assume you have installed JDK 1.6 on your machine and have JAVA_HOME correctly configured
  • we assume you have a SVN client for installing local code.
  • we use linux as a default environment
  • the conversion code is published at http://code.google.com/p/data-gov-wiki/
  • the default root dir of data-gov is identified as <DATA-GOV-ROOT-DIR>/

For data-gov project insiders

  • you can bypass installation and directly log on data-gov machine to run the conversion problem
 client:  ssh, putty ..
 data-gov machine name: sam.tw.rpi.edu  -- :) Troy, NY is the hometown of Uncle Sam...
 

Install Conversion code

create directory and move in it

 mkdir <DATA-GOV-HOME>
 cd <DATA-GOV-HOME>
 

check out source code at googlecode, follow their instructions

 svn checkout http://data-gov-wiki.googlecode.com/svn/trunk/ data-gov-wiki
 

Putting CSV File in Right Place

In order to use the script/code we provided, the user must put their CSV file in a specific directory, then our script/code can get your CSV file.

CSV Input Directory:

<DATA-GOV-ROOT-DIR>/data/input/data-gov/<DATASET-ID> (for data-gov data)
<DATA-GOV-ROOT-DIR>/data/input/web/<DATASET-ID> (for non-data.gov data)

To get the <DATASET-ID>, you should use the ID assigned by data.gov if the dataset is from data.gov. For non-data.gov dataset, please go to Other_Dataset_Catalog to add the dataset and get an ID for it.

Coverting Process

The converting code is writing in java language. You can find the code at [1]

After you finish putting CSV file in the input directory, change directory to <DATA-GOV-ROOT-DIR>/java. The first thing you need to do is update/make sure our configuration file (convert.conf) with following content:

dir_home=<DATA-GOV-ROOT-DIR>
jobfile=data-gov-wiki/java/jov_convert_local.csv

After that, you need to update job_convert_local.csv file to include your convertiong file if it exist, you may want to comment out other files since you don't want to convert all the files listed, or create job_convert_local.csv otherwise.

This jov_convert_local.csv must contain following information:

Colunm Name: input csv dir dataset_acronym dataset_id source url property namespace id is large file output RDF dir output namespace genre
Explanation: use path relative to home-dir you may skip it use your dataset id supply it if you know where to get the csv files the same as dataset id by default, but you may reuse an id which was used by similar datasets are the csv files larger than 1M use path relative to home-dir, choose three possible output directories specify the namespace of output data(will explain later) tagging the group the dataset belongs to, use short, all-capitalized string without white space
Example: data/input/data.gov/403 OMB_PBD_Rec 403 http://www.whitehouse.gov/omb/budget/fy2010/assets/receipts.csv 401 false data/output/data.gov http://data-gov.tw.rpi.edu/raw GOV-BUDGET

Finally, you can run our script to convert the file by running this command:

ant convert -l output-local 

This command will produce the output file in the output directory you configured above.

<DATA-GOV-ROOT-DIR>/data/output (data - the output of CSV-RDF translation)
  data.gov (raw)
  data.gov-huge (raw-huge > 1G data will be generated)
  web  

now you have data converted.

Beyond Conversion

Minimal Ehanced conversion

In the above example, you should be expecting to find the output file in one of following directory:

<DATA-GOV-ROOT-DIR>/data/output/data.gov/v1/<dataset-id>/  -- the classic minimal translation results
<DATA-GOV-ROOT-DIR>/data/output/data.gov/v2/<dataset-id>/  -- the modern translation (with typed numeric literals)

Release data

The data release is controled by adminitrator, so you should not worry too much about it. Currently

 data in <DATA-GOV-ROOT-DIR>/data/output/data.gov/v1/ is copied to   <DATA-GOV-ROOT-DIR>/data/release1.0/raw/
 data in <DATA-GOV-ROOT-DIR>/data/output/web/v1/ is copied to   <DATA-GOV-ROOT-DIR>/data/release1.0/raw/
 
 http://data-gov.tw.rpi.edu/raw/  is backed by  <DATA-GOV-ROOT-DIR>/data/release1.0/raw/
 http://data-gov.tw.rpi.edu/raw2/  is backed by  <DATA-GOV-ROOT-DIR>/data/release1.0/raw2/

Dump Data to TDB Triple Store

  • Please assure PHP has been installed.
  • this instruction is only intended for data-gov insiders.

The process of dumping data to the triple store is simple:

1. ensure you have the dataset in <DATA-GOV-ROOT-DIR>//data/output directory

2. go to sam.tw.rpi.edu:/work/data-gov-wiki/dev

3. run php script. The following command will give you a list of commands you can run to dump a selected dataset into triple store

php tdbloader-prepare.php

You should expect an output looks like: php tdbloader-dataset 34, this as example, will load dataset with id 34 into the triple store.

4. run the shell script.

  php tdbloader-dataset  34   

5. restart tomcat

/sbin/service tomcat6 restart

Facts about How to generate data-gov dataRDF feed
Dcterms:created21 August 2009  +
Dcterms:creatorDominic DiFranzo  +, and Li Ding  +
Dcterms:descriptiona step-by-step tutorial for converting structured government data into data-gov RDF data
Dcterms:modified2010-5-17
Foaf:nameHow to generate data-gov data
Skos:altLabelHow to generate data-gov data  +, how to generate data-gov data  +, and HOW TO GENERATE DATA-GOV DATA  +
Personal tools
internal pages