Difficult-to-parse text data files

From Data-gov Wiki

Jump to: navigation, search
Infobox (Issue Report) edit with form
  • name: Difficult-to-parse text data files

Current Issues in data.gov



Description

Some datasets have what are technically CSV files, or comma-delimited tabulars, but they're not laid out in a tabular, easy-to-parse format. Some have multiple tables in one file, either one on top of another or one next to another. Some have little data and lots of prose. Some have parameters listed along both the initial row and column, while others use multiple rows for describing parameters. All these alternate formats make it difficult for us to do a clean conversion to RDF.

CSV files ideally should have the first row list all the parameters, and all following rows should only contain data corresponding the the parameter.

Examples

  • Dataset 1471 has metadata along the first column and along various rows. Multiple tables are one atop the other, and much of the table is taken up with long descriptions.
  • Dataset 1521 contains multiple tables one below another.
  • Dataset 1961 uses multiple rows to list parameters. For example, the row just above the data has two columns labeled "YEAR", and one must look at the row above to see that one is for a table of values given in millions of dollars, and the other for a table giving values in percent of GDP. These could have been combined into one, much more easily parsable table.

Related Tags

SMWiki: Category:Bad csv
Google Spreadsheet: machine-unfriendly (under layout) - Category:Datafile layout machine-unfriendly

Facts about Difficult-to-parse text data filesRDF feed
Dcterms:created25 June 2010  +
Dcterms:creatorSarah Magidson  +
Dcterms:modified2010-6-29
Foaf:nameDifficult-to-parse text data files
Related tagBad csv  +, and Datafile layout machine-unfriendly  +
Skos:altLabelDifficult-to-parse text data files  +, difficult-to-parse text data files  +, and DIFFICULT-TO-PARSE TEXT DATA FILES  +
Personal tools
internal pages