Generating RSS feeds for data.gov

From Data-gov Wiki

Jump to: navigation, search
Infobox (Tech Report) edit with form
  • name: Generating RSS feeds for data.gov

  • description: technical details for generating RSS for data.gov datasets.
  • creator(s): Li Ding,Dominic DiFranzo
  • created: June 27,2009
  • modified: 2009-11-18


Contents

overview

In order to follow the datasets published on data.gov, two questions need to be answered:

  • Q1: what datasets are available at data.gov in what format
  • Q2: which datasets have been recently added, deleted or updated

The second questions is hard to be followed manually, e.g. this twitter page stopped updating since June 17, 2009. Since data.gov maintains the catalog of its datasets via a machine processable format (http://www.data.gov/details/92), we can translate the catalog into RDF and then answer the above two questions via automatically maintained RSS feeds:

The process of generating those RSS also contributes a showcase of semantic web technologies:

  1. the catalog file (in CSV format) was converted into RDF in RDF/XML format.
  2. A SPARQL query was used through a SPARQL web service to construct an RSS Feed.
  3. A RDF Diff Summary web service was used to semantically summarize recently updated datasets

Do you want more RSS feeds? If you have any other interesting questions about the datasets in data.gov, please let us know.

To find more details, please go to http://data-gov.tw.rpi.edu/wiki/Generating_RSS_feeds_for_data.gov.

workflow design

here is the general workflow for creating RSS feeds for data.gov

  • 1. generate RDF dataset for data.gov catalog

[ data_gov_catalog@data.gov ] ==TRANSLATION==> [ data.rdf (RDF) ]

  • 2a. derive today's RSS feed

[ data.rdf (RDF) ] ==DERIVATION==> [ #today.rss (RSS) ]

  • 2b. derive today's RSS feed with additional ping info

[ data.rdf (RDF) ] ==PING==> [ #today-raw-ping.rss (RSS) ]

  • 3. derive update RSS feed to show recent changes in data.gov datasets

[ #today-raw-ping.rss (this version) ] + [ #today-raw-ping.rss (last version) ] ==DIFF==> [ #diff-today-raw-ping.rss (RSS) ]

  • add
  • del
  • update


data format

we choose rss 1.0 which is the latest RDF syntax for RSS. see also http://en.wikipedia.org/wiki/RSS_(file_format)


today.rss

location:

format:

   channel
      link
      title
      description
      dc:date 

   item <DATASET-HOMEPAGE>
      link
      tile
      description
      dc:date (optional [1] last modified)
      dbp:XYZ_access_point <DATA-FILE-URL> 
  • note [1]: depending on how the data got parsed: normalize date if y-m-d was specified in original dataset, and keep the original otherwise)

example: http://data-gov.tw.rpi.edu/raw/92/today.rss

<rdf:RDF 
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:dgp92="http://data-gov.tw.rpi.edu/vocab/p/92/" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:dgtwc="http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#" 
    xmlns:rss="http://purl.org/rss/1.0/" 
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
    xmlns:dg="http://data-gov.tw.rpi.edu/vocab/" 
    xmlns="http://purl.org/rss/1.0/">
 <channel rdf:about="http://data-gov.tw.rpi.edu/raw/92/today.rss">
  <dc:date>2009-07-11T02:00:06Z</dc:date>
  <title>Datasets @ Data.gov (Today)</title>
  <description>Today's RSS feeds for datasets published at data.gov</description>
  <link rdf:resource="http://data-gov.tw.rpi.edu/raw/92/today.rss"/>
 </channel>

 <item rdf:about="http://www.data.gov/details/285">
  <link>http://www.data.gov/details/285</link>
  <dgp92:csv_txt_access_point rdf:resource="http://www.epa.gov/tri/tridata/tri07/data/statedata07/NJ_2007_v07.exe"/>
  <description>The Toxics Release Inventory (TRI) is a publicly available EPA database that contains information on toxic chemical releases and waste management activities reported annually by certain industries as well as federal facilities.</description>
  <title>2007 Toxics Release Inventory data for the state of New Jersey</title>
  <dc:date>2009-06-10</dc:date>
 </item>
  ...
</rdf:RDF>

today-raw-ping.rss

location:

format:

   channel
      link
      title
      description
      dc:date

   item <DATASET_HOMEPAGE>
      link
      tile
      description
      dc:date (optional [1] last-modified)
      dgp:XYZ_access_point <DATA_FILE_URL>

   rdf:Description <DATA_FILE_URL>
      rdf:type  foaf:Document
      dc:date (optional [2] last-modified)
      dc:relation (required [3] ping-state)
      dgtwc:number_of_bytes 
  • note [1]: (i) use the latest modification date of its member data-files; then (ii) use dbp:Date_updated value (see today.rss note [1])
  • note [2]: depending on http request. fill value only when last-modified value was retrieved from http response header.
  • note [3]: record ping state here, use one from the string values {'alive','offline'}

diff-today-raw-ping.rss

location:


format:

   channel
      link
      title
      description
      dc:date

   item <DATASET-HOMEPAGE>
      link
      tile
      description  ( [1] diff description )
      dc:relation  ( [2] diff state )
  • note [1]: details the diff state with notes.
  • note [2]: record diff state, may have multiple values: {new, add, modify, delete}. the last three states are associated with properties of <DATASET-HOMEPAGE> and its <DATA-FILE-URL>.

processes

translate csv to RDF

convert the data.gov catalog csv into RDF using an online converter.

http://tw.rpi.edu/ws/csv2rdf.php?url=http://www.data.gov/data_gov_catalog.csv&xmlbase=http://data-gov.tw.rpi.edu/raw/92/data-929.rdf&propns=http://data-gov.tw.rpi.edu/vocab/p/92/

examples

generate today.rss

create initial RSS feed from the converted RDF data using a SPARQL CONSTRUCT query via an online SPARQL service. Here is a live demo

generate today-raw-ping.rss

build an enhanced RSS feed with the results of pinging the published datafils of the raw datasets.

generate diff.rss

if the catalog has been changed, compute a diff RSS to list the recently added, deleted or updated datasets.

compare today.rss to get what has been changed

Facts about Generating RSS feeds for data.govRDF feed
Dcterms:created27 June 2009  +
Dcterms:creatorLi Ding  +, and Dominic DiFranzo  +
Dcterms:descriptiontechnical details for generating RSS for data.gov datasets.
Dcterms:modified2009-11-18
Foaf:nameGenerating RSS feeds for data.gov
Skos:altLabelGenerating RSS feeds for data.gov  +, generating rss feeds for data.gov  +, and GENERATING RSS FEEDS FOR DATA.GOV  +
Personal tools
internal pages