TWC Data-gov Vocabulary Proposal
From Data-gov Wiki
|
Contents |
Overview
This proposal is to help building vocabularies for data.gov efforts with special focus on dataset publishing. Our proposal covers the following:
- principles for vocabulary design
- a small set of vocabularies used in data.gov dataset publishing with examples
- demos for consuming dataset metadata and lessons learned
Vocabulary Design Principles
- modular vocabulary with minimal core
- keep the core vocabulary small and stable, only include a small set of frequently used (or required) terms
- allow free extensions: more vocabulary can be contributed by anyone. It should be connected to the core ontology. It is possible to promote extension to core status later.
- choice of term
- make it easy for curator to produce metadata using the term, e.g. do they need to specify data quality ?
- make it clear on the expected range of term , e.g. should they use "New York" or "dbpedia:New_York" for spatial coverage? does it require a controlled vocabulary?
- make it clear on the expected use of term, e.g. can it be displayed in rich snippet? can it be used in SPARQL query, search or facet browsing?
- try to reuse a term from existing popular vocabulary
- identify the required, recommended, and optional terms
- best practices for actual usage
- we certainly want the metadata to be part of linked data, but that is not the end. We would like to see the linked data actually being used by users who don't know much about the semantic web.
- we should make vocabulary available in different formats, e.g. RDFa, Microformat, ATOM, JSON, XML Schema, OData
- we should build use cases, tools and demos to exhibit the use of vocabulary to promote adoption
Data-gov Vocabulary
our current vocabulary proposal is limited to RDF and RDFa. Although we have been using some vocabularies now, we are open to other vocabularies (e.g. DCAT, VOID, SCOVO) and non-RDF approach (microformat, XML, UML).
GOV Dataset Metadata
Core Metadata
- title: property:foaf:name
- url: property:foaf:homepage
- description: property:dcterms:description
- keyword: property:dcterms:subject
- agency: property:dcterms:publisher -> property:dcterms:creator
- identifier: property:dcterms:identifier
- type: Category:dgtwc:Dataset --> void:Dataset
some more potential terms
- geographic coverage: property:dcterms:spatial, state, country
- temporal coverage: starting year, end year
- date released: property:dcterms:issued
- metadata explaining the dataset
LOD Metadata
- number of triples: property:dgtwc:number_of_triples
- RDF Raw Data: property:dgtwc:complete_data --> raw_data
- RDF Enhancement Data: property:dgtwc:more_data --> ehancement_data
- SPARQL Endpoint: property:dgtwc:sparql_endpoint
- LOD entry point: property:dgtwc:linked_data_entry_point
More Metadata
see #Appendix A. Mapping from data.gov catalog terms to RDF terms
Examples
- USGS earthquake dataset
- wiki page Dataset 34 (embedding RDFa, see parsed rdf )
- RDF/XML: http://data-gov.tw.rpi.edu/vocab/Dataset_34
GOV Demo Metadata
Core Metadata
- title: property:foaf:name
- url: property:foaf:homepage
- image: property:foaf:depiction
- description: property:dcterms:description
- keyword: property:dcterms:subject
- creator: property:dcterms:creator
- created date: property:dcterms:created
- type: Category:Demo
- suggested
- technology used
- Live demo
- video demo
- sample sparql query
- modified -> property:dcterms:issued property:dcterms:modified
More Metadata
see #Appendix A. Mapping from data.gov catalog terms to RDF terms
Examples
- USGS earthquake dataset
- wiki page Demo:_Linking_Wildland_Fire_and_Government_Budget (embedding RDFa, here is parsed RDF/XML)
- RDF/XML: http://data-gov.tw.rpi.edu/vocab/Demo:_Linking_Wildland_Fire_and_Government_Budget
Additional Thinking
RDFa publishing
we are checking how major search engine can consume our metadata
- Yahoo SearchMonkey: we started with "document" http://developer.search.yahoo.com/help/objects/documents and ended up with "news" http://developer.search.yahoo.com/help/objects/news as the properties fit better.
Example RDF directly converted from RDF data:
- wiki page Dataset 34 (embedding RDFa, see parsed rdf )
- RDF/XML: http://data-gov.tw.rpi.edu/vocab/Dataset_34
Issues
- URI naming - we can get the same (or very similar) RDF from two different URL (one from RDF/XML dump and one from HTML embedding RDFa)
- the support of RDFa from major Web players are quite limited.
Linked Government Data Publishing
- URI design for RDF conversion of CSV-based data - best practice proposal for naming URIs in RDF converted from CSV data. also discussing publishing Linked Government data.
More Government Metadata Data Proposals
- Sunlight Lab vocabulary proposal http://wiki.sunlightlabs.com/Government_Data_Catalog_Guidelines
- Leigh Dodds: http://www.ldodds.com/blog/2010/04/rdf-dataset-notifications/
- DERI's proposal: http://vocab.deri.ie/dcat-overview
Change of dataset
- Demo:_Tracking_Changes_of_data.gov_Catalog_via_RSS
- PML OWL ontology: http://inference-web.org/2.0/pml-owl.owl
- also see change set vocabulary: http://vocab.org/changeset/schema.html
Browse and Search Datasets
Lessons Learned
Appendix A. Mapping from data.gov catalog terms to RDF terms
source: http://www.data.gov/glossary
| term | source | definition | local property | mapped property | core concept |
|---|---|---|---|---|---|
| title | (source: DATA-GOV) | title of dataset | property:92/title | property:foaf:name | *
|
| url | (source: DATA-GOV) | url of dataset | property:92/url | property:foaf:homepage | *
|
| description | (source: DATA-GOV) | description of dataset | property:92/description | property:dcterms:description | *
|
| agency | (source: DATA-GOV) | agency who publishes the dataset | property:92/agency | property:dcterms:publisher (better use property:dcterms:creator?) | *
|
| subagency | (source: DATA-GOV) | subagency (typically bureau) who publishes the dataset | property:92/subagency | property:dcterms:publisher (better use property:dcterms:creator?) | *
|
| Date Released | (source: Data.gov) | The date that the dataset was originated. | property:92/date released | property:dcterms:created | *
|
| Date Updated | (source: Data.gov) | The date that the dataset was last modified. | property:92/date updated | property:dcterms:modified | *
|
| Time Period | (source: Data.gov) | Date or time interval(s) for which the dataset provides data. | property:92/time period |
| |
| Frequency | (source: Data.gov) | Frequency of data collection (one-time, annual, hourly, etc.). | property:92/frequency |
| |
| Data.gov Data Category Type | (source: Data.gov) | The category designation for the entry as either an instantly downloadable raw data file or tool (i.e., data extraction and mining or widget). | property:rdf:type | *
| |
| Specialized Data Category Designation | (source: Data.gov) | The type of dataset (e.g., administrative, geospatial, research, or statistical). Some types of data have additional metadata requirements. | property:92/specialized data category designation | property:dcterms:type | *
|
| Keywords | (source: Dublin Core) | Used to describe the content of the resource. The element may use controlled vocabularies or words or phrases that describe the subject or content of the resource. | property:92/keywords | property:dcterms:subject |
|
| Unique ID | (source: Data.gov) | An unambiguous reference to the resource within a given context. Dublin Core defines best practice for this field as identifying the resource by a unique number (e.g., ISBN, ISSN, URL/URI, etc.). The Unique ID is intended for Data.gov internal reference only. | property:92/unique id | property:dcterms:identifier property:dgtwc id ? | *
|
| Citation | (source: FGDC-STD-001-1998) | The recommended reference citation to be used to cite the dataset. | property:92/citation |
| |
| Agency Program Page | (source: Data.gov) | The URL link (and name, if applicable) to the home page of the agency or program that is the dataset owner. | property:92/agency program page |
| |
| Agency Data Series Page | (source: Data.gov) | The URL link (and name, if applicable) to the agency web page where the link to the dataset is located. This is different from the URL for the actual dataset. | property:92/agency data series page | property:dcterms:source |
|
| Unit of Analysis | (source: Data.gov) | The level of granularity or aggregation which is represented by a single record or observation in a dataset (e.g. person, household, production workers, establishment, city, country). | property:92/unit of analysis |
| |
| Granularity | (source: Dublin Core) | The level of detail at which an information object or resource is viewed or described. | property:92/granularity |
| |
| Geographic Coverage | (source: Dublin Core) | Used to designate the extent or scope of the content of the resource and typically includes spatial location (a place name or geographic co-ordinates). | property:92/geographic coverage | property:dcterms:spatial | *
|
| Collection Mode | (source: Data.gov) | Identifies the modality of the instrument used to gather data for the dataset (e.g., phone/paper, phone/computer, person/paper, person/computer, web, fax, other). | property:92/collection mode |
| |
| Data Collection Instrument | (source: Data.gov) | Identifies the specific instrument or tool (e.g., form, survey questionnaire) used to collect the data in the dataset corresponding to the collection mode. | property:92/data collection instrument |
| |
| Data Dictionary Variable List | *(source: Federal Enterprise Architecture: Data Reference Model) | A database used for data that refers to the use and structure of other data; that is, a database for the storage of metadata [ANSI X3.172-1990]. | property:92/data dictionary variable list |
| |
| Data Quality | (source: OMB Guidelines for Ensuring and Maximizing the Quality, Objectivity, Utility, and Integrity of Information Disseminated by Federal Agencies, 67 FR 8452) | "Quality" is an encompassing term comprising objectivity, utility, and integrity. Sometimes these terms are referred to collectively as "quality." Any agency contributing a dataset to Data.gov must certify that the dataset conforms to the agency's information quality guidelines. | property:92/data quality |
| |
| Privacy and Confidentiality | (source: 44 U.S.C. 3542) | Preserving authorized restrictions on information access and disclosure, including means for protecting personally identifiable and proprietary information. Any agency contributing a dataset to Data.gov must certify that dissemination of the data is consistent with the agency's responsibilities under the Privacy Act and, if applicable, the Confidential Information Protection and Statistical Efficiency Act of 2002. | property:92/privacy and confidentiality |
| |
| Technical Documentation | (source: Data.gov) | Additional documentation that describes a dataset and its intended use. | property:92/technical documentation | property:rdfs:isDefinedBy |
|
| Additional Metadata | (source: Data.gov) | Additional metadata that may be available for a dataset. Such metadata may conform to an existing standard (e.g., FGDC Metadata Standard). | property:92/additional metadata |
| |
| Statistical Methodology | (source: Data.gov) | A description of the overall approach used for statistical design, sampling, data collection, statistical analysis, and estimation. | property:92/statistical methodology |
| |
| Sampling | (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) | The procedure used to define the total number of statistical observations (i.e., samples) from an overall population size. | property:92/sampling |
| |
| Estimation | (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) | The approach used to compute statistical quantities based on the observations (e.g., mean, mode, standard deviation). | property:92/estimation |
| |
| Weighting | (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) | An approach for applying a scaling factor to observations from one or more combined data series in order to normalize or otherwise adjust the observations. | property:92/weighting |
| |
| Disclosure avoidance | (source: Federal Committee on Statistical Methodology) | Techniques (e.g., aggregation) that are applied to statistical data to ensure published data cannot be used to attribute a specific value to an individual. | property:92/disclosure avoidance |
| |
| Questionnaire design | (source: Data.gov) | A structured approach used to develop a questionnaire or survey that describes the structure and content of the survey instrument and the approach intended to be used for analyzing the survey results. | property:92/questionnaire design |
| |
| Series breaks | (source: Data.gov) | A discrete event or changes to the sample, the population, their environment, or the survey instrument occurring within a data collection that may affect statistical estimates or inferences. | property:92/series breaks |
| |
| Non-response adjustment | (source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) | The approach for adjusting observations to account for missing or incomplete data within a series. |
| ||
| Seasonal adjustment | (source: Wikipedia) | A statistical method for removing the effects of seasonal variation of a time series that is used when analyzing non-seasonal trends. | property:92/seasonal adjustment |
| |
| Statistical Characteristics (CV, CI, variance, etc.) | *(source: Box, Hunter, and Hunter, Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, 1978) | Summary of statistical characteristics that reflect the overall accuracy and correlation of a statistical data sample relative to the overall population including coefficients of variation, confidence intervals, and variance. |
| ||
Facts about TWC Data-gov Vocabulary ProposalRDF feed
| Dcterms:created | 22 April 2010 + |
| Dcterms:creator | Li Ding + |
| Dcterms:description | This document is create to help data.gov developers to understand and use a small common vocabulary that describes government dataset. |
| Dcterms:modified | 2010-4-23 |
| Foaf:name | TWC Data-gov Vocabulary Proposal |
| Skos:altLabel | TWC Data-gov Vocabulary Proposal +, twc data-gov vocabulary proposal +, and TWC DATA-GOV VOCABULARY PROPOSAL + |

