Hiding data files behind a user interface

From Data-gov Wiki

Jump to: navigation, search
Infobox (Issue Report) edit with form
  • name: Hiding data files behind a user interface

  • creator(s): Tim Lebo
  • modified: 2010-6-25

Current Issues in data.gov



Contents

Datasets reported as CSV/TXT

Description

There are datasets for which the CSV/TXT access point leads to an interactive web application that will produce data. In some cases, after filling in things or checking boxes, you can get the data in an easily parsable format. In others, the program will output the data in a nice format for human readability, for example a table or a graph. Sometimes these latter applications will have the data stored in something more machine-friendly in an accessible location. As long as these files can be found, there is still a possibility for RDF conversion.

In general, data.gov datasets that are listed as having "raw" data, for example ones with CSV/TXT access points, should point to machine-readable data and not web applications.

Examples

  • Dataset 95 will take you to a data-retrieval site which requires you to follow several links, fill in some fields, and submit the request before returning a CSV file.
  • Many of the Department of Labor's datasets are like dataset 320 (Consumer Price Index), which takes you to an interactive site, which in turn produces a pretty table of data that you can download as XLS. The actual data files for this dataset are located on an FTP server.
  • While many Department of Labor datasets have data files in a location that is not too difficult to find, datasets 339 through 346 do not seem to have this. We can't do anything interesting with the data beyond what's on the BLS website.

Related Tags

SMWiki: Category:Bad access point, Category:Alternative access point, Category:Csv as query output-convertible files, Category:Csv as query output, Category:Interactive webpage, Category:No csv
Google Spreadsheet: web application, web form (under direct format) - Category:Direct format web application, Category:Direct format web form


Other

The data.gov page for their "Dataset 990" is at http://www.data.gov/tools/990/

This links to http://federalstudentaid.ed.gov/datacenter/index.html, which in turn links to http://federalstudentaid.ed.gov/datacenter/programmatic.html

This page, three clicks and some visual search deep, now offers the files. We can select a file from one of the four drop-down menus and click GO. The browser will then download one file.

It is not apparent, but the "programmatic" page offers 49 files. With an estimated 15-second UI cost for each file, it would take more than 12 minutes to get all of the files. However, there is a chance a person would miss a few of the files in the monotony of the task. This also does not include the need to create new folders to correspond with the organization conveyed by the web page.

cat programmatic.html | grep datacenter | sed 's/^.*\.\.\///' | sed 's/">.*$//' | 
awk '{printf("curl -O %s%s\n","http://federalstudentaid.ed.gov/",$0)}' > programmatic.html.sh

A couple minutes to manually edit the shell script, and we can finally

source programmatic.html.sh

A few more minutes creating the directory structure, based on filename patterns matched by downloading a sample file from the browser again.

The URL for all 49 programmatic files is listed on TWC's Dataset 990 page.

Facts about Hiding data files behind a user interfaceRDF feed
Dcterms:creatorTim Lebo  +
Dcterms:modified2010-6-25
Foaf:nameHiding data files behind a user interface
Related tagBad access point  +, Alternative access point  +, Csv as query output-convertible files  +, Csv as query output  +, Interactive webpage  +, No csv  +, Direct format web application  +, and Direct format web form  +
Skos:altLabelHiding data files behind a user interface  +, hiding data files behind a user interface  +, and HIDING DATA FILES BEHIND A USER INTERFACE  +
Personal tools
internal pages