TWC Data-gov Vocabulary Proposal

From Data-gov Wiki

Jump to: navigation, search
Infobox (Proposal) edit with form
  • name: TWC Data-gov Vocabulary Proposal

  • description: This document is create to help data.gov developers to understand and use a small common vocabulary that describes government dataset.
  • creator(s): Li Ding
  • created: April 22, 2010
  • modified: 2010-4-23


Contents

Overview

This proposal is to help building vocabularies for data.gov efforts with special focus on dataset publishing. Our proposal covers the following:

  • principles for vocabulary design
  • a small set of vocabularies used in data.gov dataset publishing with examples
  • demos for consuming dataset metadata and lessons learned


Vocabulary Design Principles

modular vocabulary with minimal core
  • keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
  • allow free extensions: more vocabulary can be contributed by anyone. It should be connected to the core ontology. It is possible to promote extension to core status later.
choice of term
  • make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
  • make it clear on the expected range of term , e.g. should they use "New York" or "dbpedia:New_York" for spatial coverage? does it require a controlled vocabulary?
  • make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
  • try to reuse a term from existing popular vocabulary
  • identify the required, recommended, and optional terms
best practices for actual usage
  • we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don't know much about the semantic web.
  • we should make vocabulary available in different formats, e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
  • we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption

Data-gov Vocabulary

our current vocabulary proposal is limited to RDF and RDFa. Although we have been using some vocabularies now, we are open to other vocabularies (e.g. DCAT, VOID, SCOVO) and non-RDF approach (microformat, XML, UML).

GOV Dataset Metadata

Core Metadata


some more potential terms

LOD Metadata

More Metadata

see #Appendix A. Mapping from data.gov catalog terms to RDF terms

Examples

USGS earthquake dataset

GOV Demo Metadata

Core Metadata

suggested

More Metadata

see #Appendix A. Mapping from data.gov catalog terms to RDF terms

Examples

USGS earthquake dataset

Additional Thinking

RDFa publishing

we are checking how major search engine can consume our metadata

Example RDF directly converted from RDF data:

Issues

  • URI naming - we can get the same (or very similar) RDF from two different URL (one from RDF/XML dump and one from HTML embedding RDFa)
  • the support of RDFa from major Web players are quite limited.

Linked Government Data Publishing

More Government Metadata Data Proposals

Change of dataset

Browse and Search Datasets

Lessons Learned

Appendix A. Mapping from data.gov catalog terms to RDF terms

source: http://www.data.gov/glossary

term source definition local property mapped property core concept
title (source: DATA-GOV) title of dataset property:92/title property:foaf:name *


url (source: DATA-GOV) url of dataset property:92/url property:foaf:homepage *


description (source: DATA-GOV) description of dataset property:92/description property:dcterms:description *


agency (source: DATA-GOV) agency who publishes the dataset property:92/agency property:dcterms:publisher (better use property:dcterms:creator?) *


subagency (source: DATA-GOV) subagency (typically bureau) who publishes the dataset property:92/subagency property:dcterms:publisher (better use property:dcterms:creator?) *


Date Released (source: Data.gov) The date that the dataset was originated. property:92/date released property:dcterms:created *


Date Updated (source: Data.gov) The date that the dataset was last modified. property:92/date updated property:dcterms:modified *


Time Period (source: Data.gov) Date or time interval(s) for which the dataset provides data. property:92/time period


Frequency (source: Data.gov) Frequency of data collection (one-time, annual, hourly, etc.). property:92/frequency


Data.gov Data Category Type (source: Data.gov) The category designation for the entry as either an instantly downloadable raw data file or tool (i.e., data extraction and mining or widget). property:rdf:type *


Specialized Data Category Designation (source: Data.gov) The type of dataset (e.g., administrative, geospatial, research, or statistical). Some types of data have additional metadata requirements. property:92/specialized data category designation property:dcterms:type *


Keywords (source: Dublin Core) Used to describe the content of the resource. The element may use controlled vocabularies or words or phrases that describe the subject or content of the resource. property:92/keywords property:dcterms:subject


Unique ID (source: Data.gov) An unambiguous reference to the resource within a given context. Dublin Core defines best practice for this field as identifying the resource by a unique number (e.g., ISBN, ISSN, URL/URI, etc.). The Unique ID is intended for Data.gov internal reference only. property:92/unique id property:dcterms:identifier property:dgtwc id ? *


Citation (source: FGDC-STD-001-1998) The recommended reference citation to be used to cite the dataset. property:92/citation


Agency Program Page (source: Data.gov) The URL link (and name, if applicable) to the home page of the agency or program that is the dataset owner. property:92/agency program page


Agency Data Series Page (source: Data.gov) The URL link (and name, if applicable) to the agency web page where the link to the dataset is located. This is different from the URL for the actual dataset. property:92/agency data series page property:dcterms:source


Unit of Analysis (source: Data.gov) The level of granularity or aggregation which is represented by a single record or observation in a dataset (e.g. person, household, production workers, establishment, city, country). property:92/unit of analysis


Granularity (source: Dublin Core) The level of detail at which an information object or resource is viewed or described. property:92/granularity


Geographic Coverage (source: Dublin Core) Used to designate the extent or scope of the content of the resource and typically includes spatial location (a place name or geographic co-ordinates). property:92/geographic coverage property:dcterms:spatial *


Collection Mode (source: Data.gov) Identifies the modality of the instrument used to gather data for the dataset (e.g., phone/paper, phone/computer, person/paper, person/computer, web, fax, other). property:92/collection mode


Data Collection Instrument (source: Data.gov) Identifies the specific instrument or tool (e.g., form, survey questionnaire) used to collect the data in the dataset corresponding to the collection mode. property:92/data collection instrument


Data Dictionary Variable List *(source: Federal Enterprise Architecture: Data Reference Model) A database used for data that refers to the use and structure of other data; that is, a database for the storage of metadata [ANSI X3.172-1990]. property:92/data dictionary variable list


Data Quality (source: OMB Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies, 67 FR 8452) "Quality" is an encompassing term comprising objectivity, utility, and integrity. Sometimes these terms are referred to collectively as "quality." Any agency contributing a dataset to Data.gov must certify that the dataset conforms to the agency's information quality guidelines. property:92/data quality


Privacy and Confidentiality (source: 44 U.S.C. 3542) Preserving authorized restrictions on information access and disclosure, including means for protecting personally identifiable and proprietary information. Any agency contributing a dataset to Data.gov must certify that dissemination of the data is consistent with the agency's responsibilities under the Privacy Act and, if applicable, the Confidential Information Protection and Statistical Efficiency Act of 2002. property:92/privacy and confidentiality


Technical Documentation (source: Data.gov) Additional documentation that describes a dataset and its intended use. property:92/technical documentation property:rdfs:isDefinedBy


Additional Metadata (source: Data.gov) Additional metadata that may be available for a dataset. Such metadata may conform to an existing standard (e.g., FGDC Metadata Standard). property:92/additional metadata


Statistical Methodology (source: Data.gov) A description of the overall approach used for statistical design, sampling, data collection, statistical analysis, and estimation. property:92/statistical methodology


Sampling (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) The procedure used to define the total number of statistical observations (i.e., samples) from an overall population size. property:92/sampling


Estimation (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) The approach used to compute statistical quantities based on the observations (e.g., mean, mode, standard deviation). property:92/estimation


Weighting (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) An approach for applying a scaling factor to observations from one or more combined data series in order to normalize or otherwise adjust the observations. property:92/weighting


Disclosure avoidance (source: Federal Committee on Statistical Methodology) Techniques (e.g., aggregation) that are applied to statistical data to ensure published data cannot be used to attribute a specific value to an individual. property:92/disclosure avoidance


Questionnaire design (source: Data.gov) A structured approach used to develop a questionnaire or survey that describes the structure and content of the survey instrument and the approach intended to be used for analyzing the survey results. property:92/questionnaire design


Series breaks (source: Data.gov) A discrete event or changes to the sample, the population, their environment, or the survey instrument occurring within a data collection that may affect statistical estimates or inferences. property:92/series breaks


Non-response adjustment (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) The approach for adjusting observations to account for missing or incomplete data within a series.


Seasonal adjustment (source: Wikipedia) A statistical method for removing the effects of seasonal variation of a time series that is used when analyzing non-seasonal trends. property:92/seasonal adjustment


Statistical Characteristics (CV, CI, variance, etc.) *(source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) Summary of statistical characteristics that reflect the overall accuracy and correlation of a statistical data sample relative to the overall population including coefficients of variation, confidence intervals, and variance.


Facts about TWC Data-gov Vocabulary ProposalRDF feed
Dcterms:created22 April 2010  +
Dcterms:creatorLi Ding  +
Dcterms:descriptionThis document is create to help data.gov developers to understand and use a small common vocabulary that describes government dataset.
Dcterms:modified2010-4-23
Foaf:nameTWC Data-gov Vocabulary Proposal
Skos:altLabelTWC Data-gov Vocabulary Proposal  +, twc data-gov vocabulary proposal  +, and TWC DATA-GOV VOCABULARY PROPOSAL  +
Personal tools
internal pages