Current Issues in data.gov

From Data-gov Wiki

Jump to: navigation, search
Infobox (Essay)
edit with form

  • description: Not all datasets in data.gov labeled as CSV/TXT are friendly to machine consumption. Here are our findings.
  • creator(s): Sarah Magidson,Li Ding,Dominic DiFranzo
  • created: 2009/07/15
  • modified: 18 November 2009 07:32:57

Contents

Overview

While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:

  • Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 () is a subset of Dataset 191 ().
  • Formatting Issues - The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (). Some websites, meanwhile, have no data at all: Dataset 335 (), for example, tells you how to order data from the government.
  • Access Point Issues - The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 () and Dataset 96 ().

For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .

Details

Duplicated Datasets

  • The EPA publishes both a nation-wide dataset and state-wide datasets on the Toxics Release Inventory. The national data files of all US States and Territories - Dataset 191 (2005), Dataset 249 (2006), and Dataset 307 (2007)—are supersets of the 171 (3*57) state-and-territory-specific datasets.
  • Dataset 59 is a subset of Dataset 10, containing only columns with general information and data specific to energy consumption. Dataset 10 contains all the data from the Residential Energy Consumption Survey, conducted by the Energy Information Administration.

Formatting Issues

According to data.gov, CSV/TXT files are there to "Use... for easy access to the data. [They] could be opened by most desktop spreadsheet applications". We found many CSV/TXT links on data.gov that do not follow this criterion, however. What follows is the detailed categorization of the data formats we have encountered:

Missing Headers

Symptoms

  • Datasets in this category consist of CSV file(s) which only have data. The files are missing the first line, which should supply names for each field. This means the field names then have to be hunted down in some other location.
  • Otherwise, the CSV is formatted properly.

Examples

  • All the Medicare datasets are like this.
  • Look at Dataset 745 (Skilled Nursing Facility Medicare Cost Report Data, 1996).

Comments

  • It's easiest for parsing if the field names are right in the data files. If they aren't, their location should be made obvious.

Related Tags

Text with Fixed-Width Columns

Symptoms

  • Instead of having the fields delimited by commas, each field consists of the characters starting from one set column until a set ending column end.
  • The field names themselves do not appear in the data file, but are usually in some other related file, along with the field lengths and start columns.

Examples

Comments

  • It is possible to parse and convert these into RDF, but it would be a more difficult process because some manual work would be involved in getting the field names and determining lengths of fields.

Related Tags

Generally Hard-to-Parse Text

Symptoms

  • Some pages link to pages that have their data in a text file, but not in a format that would be easy to parse (usually they are designed for human readability).

Examples

Dataset 37 - Note that there is another chart of data above this one


Related Tags

Bad CSV

Symptoms

  • Some datasets are technically CSV, but opening up them up reveals them to be more like XLS files, and not in the proper format needed for RDF translation.

Example

  • Dataset 58: The top line does not contain field names, there is an extended header, etc.


Comments

  • This should probably only be labeled as XLS, not CSV, to make it clear that it is not ready for machine consumption as-is.

Related Tags

Unparsable Data

Symptoms

  • Sometimes a BLS dataset actually sends you to a webpage where you have to further pick out which format or what particular data you want. This data is usually in an unhelpful format such as an HTML table or a PDF.
  • Others lead to an interactive webpage, and there is no clear text source for the data. More on this in the interactive webpages section.

Examples

Comments

  • It seems that it would make more sense to put a direct link to each dataset as a separate page on data.gov.

Related Tags

No Data

Symptoms

  • A couple "datasets" don't even have much connection to data at all.

Examples

  • Dataset 336 takes you to the homepage of the Occupational Outlook Handbook.
  • Dataset 335 tells you where you can order data from the government.

Comments

  • A strong hint that neither of these should have their own page on data.gov is that neither is under the "Databases & Tables" tab of the BLS website: The former is under "Publications", and the latter is under "Subject Areas".

Related Tags

Access Point Issues

Besides the format of data, the URLs given as the CSV access points sometimes introduced additional complications. What follows is the detailed categorization of the data access point issues we have encountered:

Interactive Websites

Websites Which Produce CSV or CSV-compatible Data by Query

Symptoms

  • Partial data can be obtained as a query result from an interactive website reached by the access point of the dataset. The result data are either in CSV or some easy to parse format. The websites can produce a parsable piece of text, but only after fiddling with CGI arguments in the URL.

Examples

  • Dataset 95 (National Water Quality Assessment Data Warehouse):
    • The CSV link leads to the homepage of the organization.
    • A couple clicks on "retrieve data" followed by a click on a selected topic will bring up the interactive webpage.
    • For the first topic, Animal Tissue, it is possible to get a CSV file from the URL:


http://infotrek.er.usgs.gov/nawqa_queries/tismaster/exportCSV?
 followed by all the various arguments, as shown  here.
    • It is not possible to get all the data at once (leaving all arguments possible blank) because the website will not export CSV files larger than 50,000 rows.
  • Dataset 96 (USGS Water Data):
    • This CSV link also leads to an organization's homepage.
    • There is a long page describing how to conduct automated retrieval of data off the site.
    • Getting real-time data using the URL is fairly easy and is described here.
    • Getting a general dataset that is not aimed at a specific subset of the data, however, seems difficult, if not impossible.

Comments

  • This method only seems capable of getting some of the data, not the entire set as we would like to.

Related Tags

Websites with Text Data on an Alternate Page

Symptoms

  • The link given as the CSV access-point is an interactive webpage really meant for humans, not machines.
  • From looking under "additional metadata", though, one discovers that all the data from the webpage is actually in CSV form in a directory elsewhere. As long as it's possible to find these directories, the data will still be easy to parse.

Examples

  • Much of the data from the Bureau of Labor Statistics (for example, Dataset 330 and Dataset 317) are like this.


Comments

  • Ideally, data.gov would link straight to these text files instead of to interactive webpages.

Related Tags

Cannot Find Data at Access Point

Symptoms

  • Some datasets also lead to an interactive webpage, but they do not provide a way of accessing all the data in an easily-parsable format (like CSV).
  • A bad access point also goes hand-in-hand with improper formatting, no data, etc.

Examples

  • The regional offices of the Bureau of Labor Statistics (datasets 339 through 346).

Comments

  • There does not seem to be any other location where all the data is held, as is the case with many other BLS datasets.

Related Tags

Facts about Current Issues in data.govRDF feed
Dc:creatorSarah Magidson  +, Li Ding  +, and Dominic DiFranzo  +
Dc:descriptionNot all datasets in data.gov labeled as CSV/TXT are friendly to machine consumption. Here are our findings.
Dcterms:created2009/07/15
Foaf:nameCurrent Issues in data.gov
Personal tools