Current Issues in data.gov
From Data-gov Wiki
We maintain a list of issue reports to reflect our latest findings in data.gov datasets
- Extra headers and footers in CSV files (Dcterms:created 25 June 2010, Sarah Magidson)
- CSV access point leads to exe (Tim Lebo)
- CSV access point leads to website (Dcterms:created 24 June 2010, Sarah Magidson)
- CSV files are actually in LRECL format (Dcterms:created 24 June 2010, Sarah Magidson)
- CSV data files missing headers (Dcterms:created 23 June 2010, Sarah Magidson)
- CSV files use delimiters other than commas (Dcterms:created 25 June 2010, Sarah Magidson)
- Data files reported as CSV are not CSV (Tim Lebo)
- Hiding data files behind a user interface (Tim Lebo)
- Access points to broken links (Dcterms:created 25 June 2010, Sarah Magidson)
- CSV access points to non-CSV data files (Dcterms:created 25 June 2010, Sarah Magidson)
- Web confirmation required for download (Dcterms:created 25 June 2010, Sarah Magidson)
- CSV files contain variable number of values (Tim Lebo)
- Double quotes should be escaped in data.gov catalog csv (Dcterms:created 18 August 2009, Li Ding)
(Note: A tangential subject is the User Experience Issues page, which covers how users are reacting to the current presentation of the data catalog)
While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:
- Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 () is a subset of Dataset 191 (2005 Toxics Release Inventory National data file of all US States and Territories (Environmental Protection Agency)).
- Formatting Issues - The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (Department of the Interior)). Some websites, meanwhile, have no data at all: Dataset 335 (National Longitudinal Surveys (Department of Labor)), for example, tells you how to order data from the government.
- Access Point Issues - The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 (Local Area Unemployment Statistics (Department of Labor)) and Dataset 96 (National Water Information System (NWIS) (Department of the Interior)).
For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .
Issues with Data.gov Raw Data
- The EPA publishes both a nation-wide dataset and state-wide datasets on the Toxics Release Inventory. The national data files of all US States and Territories - Dataset 191 (2005), Dataset 249 (2006), and Dataset 307 (2007)—are supersets of the 171 (3*57) state-and-territory-specific datasets.
- Dataset 59 is a subset of Dataset 10, containing only columns with general information and data specific to energy consumption. Dataset 10 contains all the data from the Residential Energy Consumption Survey, conducted by the Energy Information Administration.
According to data.gov, CSV/TXT files are there to "Use... for easy access to the data. [They] could be opened by most desktop spreadsheet applications". We found many CSV/TXT links on data.gov that do not follow this criterion, however. What follows is the detailed categorization of the data formats we have encountered:
- Category:Bad csv (total:0; )
- Category:Confirmed csv (total:12; show all)
- Category:Csv as query output-convertible files (total:1; show all)
- Category:Csv missing header (total:54; show all)
- Category:Fixed-width-column txt (total:32; show all)
- Category:Interactive webpage (total:7; show all)
- Category:No csv (total:6; show all)
- Category:Tab-delimited csv (total:287; show all)
Datasets labeled as "confirmed CSV" are those datasets that really are in CSV format and are easily parsable. "Tab-delimited CSV" means that the data file(s) follow the conventions of CSV files, but use tabs instead of commas. These are also easy to parse, but we note it because our RDF translation program expects commas as delimiters.
Access Point Issues
Besides the format of data, the URLs given as the CSV access points sometimes introduced additional complications. What follows is the detailed categorization of the data access point issues we have encountered:
- Category:Alternative access point (total:18; show all)
- Category:Bad access point (total:15; show all)
- Category:Good access point with set of files (total:277; show all)
- Category:Good access point with single file (total:59; show all)
Datasets labeled as "Good access point" generally have no issues of this kind, but have some other formatting issue, as discussed above. "Good access point with single file" and "with set of files" refer to datasets which have good access points as well as good formatting (with the exception of Dataset 59).
- 10 flaws with the data on Data.gov (So, can you trust the data provided on Data.gov? A cursory examination of the newly released high-value datasets revealed 10 types of quality deficiencies., http://gcn.com/articles/2010/03/15/reality-check-10-data-gov-shortcomings.aspx)
|Dcterms:created||15 July 2009 +|
|Dcterms:creator||Sarah Magidson +, Li Ding +, and Dominic DiFranzo +|
|Dcterms:description||Not all datasets in data.gov labeled as CSV/TXT are friendly to machine consumption. Here are our findings.|
|Foaf:name||Current Issues in data.gov|
|Skos:altLabel||Current Issues in data.gov +, current issues in data.gov +, and CURRENT ISSUES IN DATA.GOV +|