  What's in data.gov

  statistical survey of the data.gov datasets
  Li Ding,Dominic DiFranzo
  June 26,2009
  2010-5-19



This article presents the current status of the datasets indexed and converted at http://data-gov.tw.rpi.edu. The datasets can be roughly partitioned as follows:

Original Data.gov Datasets

Information about Data.gov datasets are from data.gov catalog, a special Data.gov dataset containing the catalog metadata about data.gov datasets, and some conversion results. You may browse its RDF version using Tabulater following this link.

Live and Deleted Datasets

Historically, there were 6468 Category:Data.gov Dataset; however, some (7 , see Category:Data.gov Deleted Dataset) were deleted after being published, and the rest (6458, see Category:Data.gov Live Dataset) are still live. In what follows, we only count live data.gov datasets.

Format of datasets' access points

data.gov mentioned 4573 Data files as the access points of the datasets:

Tag cloud of keywords

We can easily generate a tag cloud from all keywords of the data.gov datasets using a SPARQL query and here is the query result.

Snapshot Tag Cloud as of April 7 ,2010

Sources of datasets

the datasets are contributed by 66 US government agencies. Following is a list of top 10 agencies with highest number dataset contribution.

  1. Environmental Protection Agency (1,530)
  2. National Center for Health Statistics (631)
  3. Department of Defense (413)
  4. US Fish and Wildlife Service (411)
  5. White House (392)
  6. National Transportation Safety Board (382)
  7. United States Senate (377)
  8. U.S. - Canada Border Environment Cooperation Commission (369)
  9. National Center for Education Statistics (367)
  10. Pension Benefit Guaranty Corporation (353)
  11. more...
Agency Dataset Contribution (snapshot as of June 2009 )

RDFized Datasets

There are currently 311 Category:RDFized Dataset, including

category no. of RDF-ized datasets no. of triples (Billion) no. of instances (Million) notes
Data.gov Raw Data Catalog 252 3.104 253.84 covering the content of 538 out of 3117,see data
Data.gov Tool Catalog 1 0.018 1.52 covering the content of 1 out of 609, see data
Data.gov Deleted Data 2 0.001 0.15 covering the content of 2 out of 7, see data
Total Data.gov Data 257 3.129 256.06 covering the content of 544 out of 3736
Other Government Data 23 0.053 2.22 see data
Non Government Data 0 Expression error: Unexpected / operator Expression error: Unexpected / operator
User Generated Data 6 0 0 see data
Total RDF-ized Data 286 3.182 258.29 see data

