How to Build Data-gov Semantic Search to Consume RDFa

From Data-gov Wiki

Jump to: navigation, search
Infobox (How-To) edit with form
  • name: How to Build Data-gov Semantic Search to Consume RDFa

  • description: A tutorial on how to build a specialized search for government data, and used RDFa annotations to better render the search results.
  • creator(s): Jin Guang Zheng,Li Ding
  • created: 2010/05/06
  • modified: 2010-5-16


Contents

Overview

This tutorial provides a step by step solution to search contents in data-gov website and render the results using the embedded RDFa annotations:

  • the semantic annotations in web pages are encoded in RDFa. Note that RDFa was embedded in Wiki pages using Mediawiki RDFa extension.
  • the search results was retrieved using the web search API provided by Yahoo! Search BOSS
  • we parses RDFa annotations using ARC2 and wrote PHP code to enhance the rendering of search results
  • the search response is accelerated by a linked data cache powered by ARC2

Step 1. Get Yahoo! Search BOSS API Key

Yahoo! Search BOSS allows users to access yahoo search via a programmable API. You need to obtain a key to use the API. To get the key, you need to have a Yahoo ID and follow instructions below:

  • Select New Web/Client Apps when asked what kind of application you are building.

  • choose Generic, No user authentication required as authentication method
  • fill the required fields in the form

Step 2. Compose Yahoo! Search BOSS search request

Since Yahoo! Search BOSS is a restful API, a search request is essentially a complex URI in the following form

http://boss.yahooapis.com/ysearch/web/v1/QUERY_STRING?appid=APP_KEY&format=RESULT_FORMAT&count=RESULT_LIMIT

The BLUE_CAPS are where you put values of interest.

BLUE_CAP Description
QUERY_STRING put a query string here, e.g., "oil spill 2004". (you may substitute white space with '%20')
APP_KEY put your application key here.
RESULT_FORMAT the format of the results, put either xml or json here.
RESULT_LIMIT the number of results to return, put a number (e.g. 10) here.

For this tutorial, feel free to use the application key:

6MFMdTrV34Eu3oQlyqlSUMfqp0GZKTRQyT47AYi0XSvVyxMjoNPdZW41wqKm

In the example below, we are searching for the term "Budget" on the site data-gov.tw.rpi.edu and are asking for ten results in XML:

http://boss.yahooapis.com/ysearch/web/v1/Budget site:data-gov.tw.rpi.edu?appid=6MFMdTrV34Eu3oQlyqlSUMfqp0GZKTRQyT47AYi0XSvVyxMjoNPdZW41wqKm&format=xml&count=10

Step 3. Consume Yahoo! Search BOSS search results

Based on the previously constructed query, we can get 10 results in XML format. Following is a fragment of results including (i) the summary of search results and (ii) the first returned result:

<ysearchresponse xmlns="http://www.inktomi.com/" responsecode="200">
  <nextpage>/ysearch/web/v1/Budget%20site:data-gov.tw.rpi.edu?format=xml&count=10&appid=6MFMdTrV34Eu3oQlyqlSUMfqp0GZKTRQyT47AYi0XSvVyxMjoNPdZW41wqKm&start=10</nextpage>
  <resultset_web count="10" start="0" totalhits="101" deephits="102">
      <result>
         <abstract>description: Timeline for an agency's aggregated <b>budget</b> from 1976 to 2014. <b>...</b> This visualization shows <b>budget</b> authority and outlays for the selected agency, <b>...</b></abstract>
         <clickurl>http://lrd.yahooapis.com/_ylc=X3oDMTU0ZjR2NTF2BF9TAzIwMjMxNTI3MDIEYXBwaWQDNk1GTWRUclYzNEV1M29RbHlxbFNVTWZxcDBHWktUUlF5VDQ3QVlpMFhTdlZ5eE1qb05QZFpXNDF3cUttBGNsaWVudANib3NzBHNlcnZpY2UDQk9TUwRzbGsDdGl0bGUEc3JjcHZpZANUQVh2aldLSWNycGpfbmsxYkN4MTZnVlVRX2oyOVV2a0JDVUFDS0RO/SIG=122gv6r9i/**http%3A//data-gov.tw.rpi.edu/wiki/Demo%3A_Agency_Budget_Summary</clickurl>
         <date>2010/04/27</date>
         <dispurl><b>data-gov.tw.rpi.edu</b>/wiki/<wbr>Demo:_Agency_<b>Budget</b>_Summary</dispurl>
         <size>33198</size>
         <title>Demo: Agency <b>Budget</b> Summary - Data-gov Wiki</title>
         <url>http://data-gov.tw.rpi.edu/wiki/Demo:_Agency_Budget_Summary</url>
      </result>

The XML tags used in the summary of search results:

  • totalhits: number of all possible results matching this query
  • start: the index of the first returned result in all results
  • count: the number of returned results

The XML tags used in the query response:

  • prevpage: URL you can use to get the search results of previous page
  • nextpage: URL you can use to get the search results for next page.
  • resultset_web: set of results returned.
  • abstract: abstract of the page return by Yahoo!
  • url: URL of the result
  • title: Title of the result page.

for more information see http://developer.yahoo.com/search/boss/boss_guide/.

Step 4. Parse and Cache RDFa Annotation using ARC2

We build a simple linked data cache (lodcx) web service using ARC2 and PHP. The web service provide both RESTful web service and web interface for users to parse RDFa embedded in Web page, cache the parsed RDF data as named graph in local triple store and query the cached RDF data via a SPARQL Endpoint.

To use linked data cache' web service, you can compose a service request as a URI in following form

http://data-gov.tw.rpi.edu/ws/lodcx.php?url=URL&operation=OPERATION

The BLUE_CAPS are where you put values of interest.

BLUE_CAP Description
URL put a encoded URL here.
OPERATION put the operation parameter here. possible values are
  • test - just check if we can get RDF data from the URL, not affecting the triple store
  • fetch - just fetch RDF data of the named graph associated with the URL in the linked data cache
  • refresh - just refresh RDF data of the named graph associated with the URL in the linked data cache
  • refreshfetch - refresh and fetch RDF data of the named graph associated with the URL in the linked data cache


In the example below, we fetch the cached RDF data (in RDF/XML) parsed from the web page http://data-gov.tw.rpi.edu/wiki/Demo:_Agency_Budget_Summary:

http://data-gov.tw.rpi.edu/ws/lodcx.php?url=http%3A%2F%2Fdata-gov.tw.rpi.edu%2Fwiki%2FDemo%3A_Agency_Budget_Summary&operation=fetch



Step 5. Build RDFa Semantic Search

Now we put the previously mentioned tools together and build the RDFa semantic search. The program is written in PHP and leverage PHP's SimpleXML extension (requiring PHP5). Complete code can be found at http://code.google.com/p/data-gov-wiki/source/browse/trunk/www/ws/lodcs.php, and following are some key code:


1. get input parameters

The semantic search takes the following input parameters

$input_params =array();
$input_params[INPUT_QUERY]= get_param(INPUT_QUERY);
$input_params[INPUT_REFRESH]= get_param(INPUT_REFRESH, false);


2. run web search.

We pass query specified in input parameter to Yahoo and get the corresponding query results

	// compose Yahoo! Search BOSS query
	$params =array();
	$params["appid"]= YAHOO_BOSS_APPID;
	$params["start"]= $input_params[INPUT_START];
	$params["count"]= YAHOO_BOSS_COUNT_DEFAULT;
	$params["format"]="xml";

	$url_xml = YAHOO_BOSS_URL. urlencode(strtolower($query)). YAHOO_BOSS_CONSTRAINT. "?".  http_build_query($params);

	//retrieve and parse search results
	$xml=simplexml_load_file($url_xml); 


3. Use linked data cache to parse and cache RDFa annotations per page

For each returned page, we explicitly parse and load the data into triple store.

  //traverse all search results
  foreach ($xml->resultset_web[0]->result as $result)
  {  
	//get page URL
	$url = $result->url;

        // refresh linked data cache content on demand
	if ($input_params[INPUT_REFRESH]){
		//refresh data
		$url_lodcx = LODCX_SERVICE_URL."?operation=refresh&url=".urlencode($url);
		file_get_contents ($url_lodcx);
	}

	//render result
	$output = print_result($url, $result, $debug);

	//display results
	echo $output;
  }

Following are some sample RDFa embedded in http://data-gov.tw.rpi.edu/wiki/Demo:_Agency_Budget_Summary

<!-- RDFa section begin -->
<div id="RDFa" 
 typeof="dcmitype:Text sioc:Post" 
 resource="http://data-gov.tw.rpi.edu/wiki/Demo:_Agency_Budget_Summary" 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:owl="http://www.w3.org/2002/07/owl#"
  xmlns:property="http://data-gov.tw.rpi.edu/vocab/p/"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
...
  xmlns:skos="http://www.w3.org/2004/02/skos/core#">

<span property="rdfs:label dc:title" content="Demo: Agency Budget Summary"  />
<a rel="rdf:type" href="http://data-gov.tw.rpi.edu/wiki/Category:Demo" ></a>
<a rel="property:Data_source" href="http://data-gov.tw.rpi.edu/wiki/Dataset_401" ></a>
<a rel="property:Data_source" href="http://data-gov.tw.rpi.edu/wiki/Dataset_402" ></a>
<span property="foaf:name" content="Demo: Agency Budget Summary"  />
...

</div>
<!-- RDFa section end -->


4. Query cached RDF data in triple store and render results

Given the URL of a result page, we can compose a SPARQL query against the triple store of linked data cache. The following code leverage ARC2 API to load the value(s) of a result page's property.

function query_property_value($url, $property, $store, $debug =false){
	$q = "SELECT ?o WHERE { GRAPH <$url> { <$url> <$property> ?o }}";      
	$rs = $store->query($q);
	if (!$store->getErrors()) {
		$ret = array();
		foreach($rs['result']['rows'] as $row)
		{
			$ret[]=$row['o'];
		}
		return $ret;
	}
	return false;
}

Then we can use the above function to enable a customized rendering of search result.

function print_result($url, $result , $debug=false ){
        //add semantic results
        $output_s = "";
        //get native access to linked data cache's triple store
        $store = arc2_get_store(get_arc2_config());

        //add "name" annotation of the page
        $temp ="";
        {
                $values = query_property_value($url, "http://xmlns.com/foaf/0.1/name", $store, $debug);
                foreach ($values as $value){
                        if (!empty($temp))
                                $temp .=" , ";
                        $temp.= sprintf("%s ", $value);
                }
        }
        if (!empty($temp))
                $output_s.= "\n<div><b>name</b>: $temp\n</div>\n";
       ...

The resulting HTML fragment of the above code looks like following:

<div><b>name</b>: Demo: Agency Budget Summary 
</div>

The following page renders the query results of searching "Budget" related pages. Yahoo search results (URL and matched text) are displayed by default, and the enhanced RDFa annotations are highlighted by light purple background.

The result page for "Budget" query, http://data-gov.tw.rpi.edu/ws/lodcs.php?query=Budget

Related Resources

Facts about How to Build Data-gov Semantic Search to Consume RDFaRDF feed
Dcterms:created6 May 2010  +
Dcterms:creatorJin Guang Zheng  +, and Li Ding  +
Dcterms:descriptionA tutorial on how to build a specialized search for government data, and used RDFa annotations to better render the search results.
Dcterms:modified2010-5-16
Foaf:nameHow to Build Data-gov Semantic Search to Consume RDFa
Skos:altLabelHow to Build Data-gov Semantic Search to Consume RDFa  +, how to build data-gov semantic search to consume rdfa  +, and HOW TO BUILD DATA-GOV SEMANTIC SEARCH TO CONSUME RDFA  +
Personal tools
internal pages