Current Issues in data.gov
From Data-gov Wiki
| Infobox (Essay) edit with form |
|---|
|
|
Contents |
Overview
While translating data.gov data into RDF, we have discovered some issues with the published datasets. These issues can be roughly categorized as follows:
- Duplicated Datasets- Some datasets are part of another dataset, e.g. Dataset 140 () is a subset of Dataset 191 ().
- Formatting Issues - The format of some datasets is not friendly to machine processing. Not all datasets offer CSV format data, and parsing table data from them requires non-trivial efforts. Example: Dataset 37 (). Some websites, meanwhile, have no data at all: Dataset 335 (), for example, tells you how to order data from the government.
- Access Point Issues - The access points for some datasets do not point to pages friendly to machine access. Instead of pointing to a downloadable file covering the entire dataset, some lead to an interactive website where only partial data can be returned by a web-based query. Example: Dataset 330 () and Dataset 96 ().
For more details, please visit http://data-gov.tw.rpi.edu/wiki/Current_Issues_in_data.gov .
Details
Duplicated Datasets
- The EPA publishes both a nation-wide dataset and state-wide datasets on the Toxics Release Inventory. The national data files of all US States and Territories - Dataset 191 (2005), Dataset 249 (2006), and Dataset 307 (2007)—are supersets of the 171 (3*57) state-and-territory-specific datasets.
- Dataset 59 is a subset of Dataset 10, containing only columns with general information and data specific to energy consumption. Dataset 10 contains all the data from the Residential Energy Consumption Survey, conducted by the Energy Information Administration.
Formatting Issues
According to data.gov, CSV/TXT files are there to "Use... for easy access to the data. [They] could be opened by most desktop spreadsheet applications". We found many CSV/TXT links on data.gov that do not follow this criterion, however. What follows is the detailed categorization of the data formats we have encountered:
- Category:Bad csv (total:1; show all)
- Category:Confirmed csv (total:14; show all)
- Category:Csv as query output (total:1; show all)
- Category:Csv as query output-convertible files (total:1; show all)
- Category:Csv missing header (total:62; show all)
- Category:Fixed-width-column txt (total:35; show all)
- Category:Interactive webpage (total:8; show all)
- Category:No csv (total:6; show all)
- Category:Tab-delimited csv (total:311; show all)
Datasets labeled as "confirmed CSV" are those datasets that really are in CSV format and are easily parsable. "Tab-delimited CSV" means that the data file(s) follow the conventions of CSV files, but use tabs instead of commas. These are also easy to parse, but we note it because our RDF translation program expects commas as delimiters.
Missing Headers
Symptoms
- Datasets in this category consist of CSV file(s) which only have data. The files are missing the first line, which should supply names for each field. This means the field names then have to be hunted down in some other location.
- Otherwise, the CSV is formatted properly.
Examples
- All the Medicare datasets are like this.
- Look at Dataset 745 (Skilled Nursing Facility Medicare Cost Report Data, 1996).
Comments
- It's easiest for parsing if the field names are right in the data files. If they aren't, their location should be made obvious.
Related Tags
Text with Fixed-Width Columns
Symptoms
- Instead of having the fields delimited by commas, each field consists of the characters starting from one set column until a set ending column end.
- The field names themselves do not appear in the data file, but are usually in some other related file, along with the field lengths and start columns.
Examples
- This appears to be a commonly used format for surveys, and for example is used by the libraries (Dataset 353, Dataset 354, Dataset 398, Dataset 399, and Dataset 400).
- Here is an example of Dataset 449. The data is from puout07.txt, but the description of the fields and how to parse the file is provided at http://harvester.census.gov/imls/pubs/Publications/pupld07a.pdf :
- It is possible to parse and convert these into RDF, but it would be a more difficult process because some manual work would be involved in getting the field names and determining lengths of fields.
Related Tags
Generally Hard-to-Parse Text
Symptoms
- Some pages link to pages that have their data in a text file, but not in a format that would be easy to parse (usually they are designed for human readability).
Examples
- Dataset 37 is a good example of this.
- Some of the data on the BLS website (ftp://ftp.bls.gov/pub/ , especially in special.requests) is like this.
Dataset 37 - Note that there is another chart of data above this one
Related Tags
Bad CSV
Symptoms
- Some datasets are technically CSV, but opening up them up reveals them to be more like XLS files, and not in the proper format needed for RDF translation.
Example
- Dataset 58: The top line does not contain field names, there is an extended header, etc.
- This should probably only be labeled as XLS, not CSV, to make it clear that it is not ready for machine consumption as-is.
Related Tags
Unparsable Data
Symptoms
- Sometimes a BLS dataset actually sends you to a webpage where you have to further pick out which format or what particular data you want. This data is usually in an unhelpful format such as an HTML table or a PDF.
- Others lead to an interactive webpage, and there is no clear text source for the data. More on this in the interactive webpages section.
Examples
- This is the case with datasets Dataset 349 (which points here) and Dataset 324 (which points here).
- Another BLS dataset (Dataset 348) leads to the Standard Occupational Classification site, which is a tree of links to brief job descriptions.
Comments
- It seems that it would make more sense to put a direct link to each dataset as a separate page on data.gov.
Related Tags
No Data
Symptoms
- A couple "datasets" don't even have much connection to data at all.
Examples
- Dataset 336 takes you to the homepage of the Occupational Outlook Handbook.
- Dataset 335 tells you where you can order data from the government.
Comments
- A strong hint that neither of these should have their own page on data.gov is that neither is under the "Databases & Tables" tab of the BLS website: The former is under "Publications", and the latter is under "Subject Areas".
Related Tags
Access Point Issues
Besides the format of data, the URLs given as the CSV access points sometimes introduced additional complications. What follows is the detailed categorization of the data access point issues we have encountered:
- Category:Alternative access point (total:22; show all)
- Category:Bad access point (total:16; show all)
- Category:Good access point (total:34; show all)
- Category:Good access point with set of files (total:300; show all)
- Category:Good access point with single file (total:67; show all)
Datasets labeled as "Good access point" generally have no issues of this kind, but have some other formatting issue, as discussed above. "Good access point with single file" and "with set of files" refer to datasets which have good access points as well as good formatting (with the exception of Dataset 59).
Interactive Websites
Websites Which Produce CSV or CSV-compatible Data by Query
Symptoms
- Partial data can be obtained as a query result from an interactive website reached by the access point of the dataset. The result data are either in CSV or some easy to parse format. The websites can produce a parsable piece of text, but only after fiddling with CGI arguments in the URL.
Examples
- Dataset 95 (National Water Quality Assessment Data Warehouse):
- The CSV link leads to the homepage of the organization.
- A couple clicks on "retrieve data" followed by a click on a selected topic will bring up the interactive webpage.
- For the first topic, Animal Tissue, it is possible to get a CSV file from the URL:
http://infotrek.er.usgs.gov/nawqa_queries/tismaster/exportCSV? followed by all the various arguments, as shown here.
- It is not possible to get all the data at once (leaving all arguments possible blank) because the website will not export CSV files larger than 50,000 rows.
- Dataset 96 (USGS Water Data):
- This CSV link also leads to an organization's homepage.
- There is a long page describing how to conduct automated retrieval of data off the site.
- Getting real-time data using the URL is fairly easy and is described here.
- Getting a general dataset that is not aimed at a specific subset of the data, however, seems difficult, if not impossible.
Comments
- This method only seems capable of getting some of the data, not the entire set as we would like to.
Related Tags
Websites with Text Data on an Alternate Page
Symptoms
- The link given as the CSV access-point is an interactive webpage really meant for humans, not machines.
- From looking under "additional metadata", though, one discovers that all the data from the webpage is actually in CSV form in a directory elsewhere. As long as it's possible to find these directories, the data will still be easy to parse.
Examples
- Much of the data from the Bureau of Labor Statistics (for example, Dataset 330 and Dataset 317) are like this.
- Ideally, data.gov would link straight to these text files instead of to interactive webpages.
Related Tags
Cannot Find Data at Access Point
Symptoms
- Some datasets also lead to an interactive webpage, but they do not provide a way of accessing all the data in an easily-parsable format (like CSV).
- A bad access point also goes hand-in-hand with improper formatting, no data, etc.
Examples
Comments
- There does not seem to be any other location where all the data is held, as is the case with many other BLS datasets.
Related Tags
- Category:Bad access point
- Category:Interactive webpage (formatting)
| Dc:creator | Sarah Magidson +, Li Ding +, and Dominic DiFranzo + |
| Dc:description | Not all datasets in data.gov labeled as CSV/TXT are friendly to machine consumption. Here are our findings. |
| Dcterms:created | 2009/07/15 |
| Foaf:name | Current Issues in data.gov |

