DBpedia DBpedia Spotlight D2R Server R2R Silk Sieve LDIF NG4J Marbles WIQA Pubby RAP
Open Source projects by the Data and Web Science Group:  
A Sieve module for automatically learning data fusion policies
Volha Bryl
Christian Bizer

News

Contents

  1. About
  2. Input Specification
  3. Output
  4. Quick Start and Examples
  5. Source Code and Development
  6. Support and Feedback
  7. References
  8. Acknowledgments

1. About

Sieve data quality assessment and data fusion framework provides a library of quality assessment scores and fusion functions, and an XML-based input specification language to combine them. However, manually defining a fusion policy - that is, which fusion function and with what parameters to use for which data attribute - requires certain domain knowledge and understanding of the input data, is time-consuming, and does not guarantee an optimal result.

Therefore, we introduce Fusion Policy Learner, an extension of Sieve that allows automatically selecting an optimal fusion function based on the gold standard. The user is still involved in the process as the list of possible fusion functions per property needs to be manually specified.

The learning algorithm implemented in the Fusion Policy Learner selects the fusion function that minimizes the error with respect to a gold standard. The list of fusion functions and the respective quality assessment scores for each property are defined in the input specification, which is written using the extended version of the XML-based Sieve specification language.

The learning algorithm first detects, based on the gold standard, whether the values to fuse are numeric or nominal; examples of nominal values are strings and URIs.

Then for numeric properties one of the two learning strategies is applied: (1) a fusion function that minimizes the mean absolute error with respect to the gold standard is selected, or (2) given a maximum error threshold (e.g. 5%), the function that maximizes the number of values that deviates from the gold standard no more than by a threshold, is selected. In case of nominal values, the fusion function that produces the maximum number of exact matches with the gold standard is selected.

2. Input Specification

Below you see an example of input specification for the Fusion Policy Learner:

1 <SieveFPL>
2	<Parameters>
3		<!--SelectionMethod name="MinAbsError"/-->
4		<SelectionMethod name="MaxCorrectValues" error="0.05"/>
5	</Parameters>
6	<Input>
7		<GoldStandard>gold\cities1000-Netherlands.gold.nt</GoldStandard>
8		<dumpLocation>dumps-nl</dumpLocation>	
9		<SieveExec>c:\ldif-0.5.2\bin\ldif.bat</SieveExec>
10	</Input>
11	<Output>
12		<SieveSpec>sieve-optimal\sieve_optimal.xml</SieveSpec>		
13		<FPLReport>FPL_report.txt</FPLReport>
14		<!--FPLReport valmatrix = "true">FPL_report.txt</FPLReport-->
15	</Output>
16	<Sieve xmlns="http://www4.wiwiss.fu-berlin.de/ldif/">
17		<Prefixes> 
18			<Prefix id="dbpedia-owl" namespace="http://dbpedia.org/ontology/"/>
19			<Prefix id="ldif" namespace="http://www4.wiwiss.fu-berlin.de/ldif/"/>
20			<Prefix id="sieve" namespace="http://sieve.wbsg.de/vocab/"/>
21			<Prefix id="dbpedia-meta" namespace="http://dbpedia.org/metadata/"/>
22		</Prefixes>
23		<QualityAssessment>
24			<AssessmentMetric id="sieve:authactivity">
25				<ScoringFunction class="NormalizedCount">
26					<Param name="maxCount" value="4250000"/>
27					<Input path="?GRAPH/dbpedia-meta:autheditcnt"/>
28				</ScoringFunction>
29			<AssessmentMetric id="sieve:recency">
30				<ScoringFunction class="TimeCloseness">
31					<Param name="timeSpan" value="500"/>
32					<Input path="?GRAPH/dbpedia-meta:lastedit"/>
33				</ScoringFunction>
34			</AssessmentMetric>
35		</QualityAssessment>
36		<Fusion>
37			<Class name="dbpedia-owl:PopulatedPlace">
38				<Property name="dbpedia-owl:areaTotal">
39					<FusionFunction class="KeepFirst" metric="sieve:recency"/>
40					<FusionFunction class="KeepFirst" metric="sieve:authactivity"/>
41					<FusionFunction class="Voting"/>
42					<FusionFunction class="Average"/>
43				</Property>
44				<Property name="dbpedia-owl:populationTotal">
45					<FusionFunction class="KeepFirst" metric="sieve:recency"/>
46					<FusionFunction class="Average"/>
47					<FusionFunction class="Maximum"/>
48				</Property>
49			</Class>
50		</Fusion>
51	</Sieve>
52 </SieveFPL>

The root tag is <SieveFPL>, under which 4 elements have to be specified: parameters of the learning algorithm (lines 2-5), input (lines 7-11) and output (lines 12-15) paths, and the extended Sieve specification (lines 16-20).

The <SelectionMethod> element (lines 3-4) is used to specify the parameters of the learning algorithm for numeric data values. MinAbsError value of the name attribute corresponds to the 1st learning strategy (see above), and MaxCorrectValues – to the 2nd one with error attribute defining the maximum error threshold.

In the <Input> section the following parameters are specified:

The <Output> section contains the following two paths to the output files generated by the tool:

Optional valmatrix attribute of <FPLReport> (line 14 should be uncommented to replace line 13) allows including the value matrix to the report, that is, for each property a list of values (one per fusion function) for each subject URI is listed. For the sample specification above, the value matrix for dbpedia-owl:populationTotal (lines 45-47) is as follows. The values refer to keep the most recent, average, maximum and gold standard value, respectively; in the breakets the data source is specifyied (language code in case of Wikipedia namespaces).
http://dbpedia.org/resource/Buitenpost		5834 (en)	5777 (average)		5834 (en)	5764 (gold)
http://dbpedia.org/resource/Noordwijkerhout	15541 (it)	15460 (average)		15541 (ru)	15071 (gold)
http://dbpedia.org/resource/Harenkarspel	15922 (ru)	15973 (average)		16076 (it)	15941 (gold)

<Sieve> element contains the extended or "redundant" Sieve specification: for each property a list of fusion functions is defined, and the FPL selects an optimal one from the list with respect to the gold standard.

In lines 39-42 4 fusion functions - keep the most recent value, keep the value added by the most active author, most frequent, average - are specified for the areaTotal property of a populated place. The FPL chooses and puts into the final Sieve specification (line 12) only one of these functions, in accordance with the selection method defined in <Parameters>.

In the current FPL version learning can be performed for only one class at a time, which means that learning an optimal fusion policy for the properties of e.g. dbpedia-owl:CelestialBody would require another specification file.

3. Output

The output of the tool consists of Below you see an extract of the report corresponding to the FPL specification presented above.
*** Learning an optimal fusion function for dbpedia-owl:populationTotal property ***
Number of gold standard values = 493
According to the gold standard, dbpedia-owl:populationTotal is NUMERIC

Pool of fusion functions:
0 : <FusionFunction class="KeepFirst" metric="sieve:recency"/>
1 : <FusionFunction class="Average"/>
2 : <FusionFunction class="Maximum"/>

Errors per fusion function (functions identified by int ID):
0, mean absolute error : 0.02368750565457104, count : 493.0
0, number of 5.0% correct values : 318
1, mean absolute error : 0.022413082496238555, count : 493.0
1, number of 5.0% correct values : 301
2, mean absolute error : 0.021932866337685517, count : 493.0
2, number of 5.0% correct values : 329

MinAbsError: best fusion function ID, error %, count: 2, 2.193286633768552, 493.0
MaxCorrectValues: best fusion function ID, number of correct values : 2, 329

The report contains the number of gold standard values for dbpedia-owl:populationTotal property (493), the detected property type (numeric), the list of the fusion functions as defined for populationTotal in the FPL specification. Fusion functions are assigned numeric IDs (0, 1, 2), which are then used to report errors for each functions with respect to the gold standard.

For numeric properties, errors are reported for both learning methods: MinAbsError and MaxCorrectValues (with the default 5% threshold), and the optimal (referred as "best" in the report) functions for both methods are listed. In our example, both selection methods resulted in maximum fusion function to be the optimal one. The final Sieve specification will include only the function which is optimal according to MaxCorrectValues method, as defined in line 4 of our sample FPL specification.

4. Quick Start and Examples

In order to demonstrate the functionalities of the Fusion Policy Learner, the multilingual DBpedia example is distributed with the LDIF binaries (dbpedia-multilang directory) and can be found in the LDIF repository. The example aims at fusing data for the same city from multiple language editions of DBpedia. In the example directory you find the input specification (SieveFPL.xml) along with data and provenance metadata dumps (dumps-3cities and dumps-nl directories, use one of the two when specifying dumpLocation in line 8) and gold standard (in gold directory).

To run the FPL with the multilingual DBpedia example, download and unpack the LDIF binaries, and run

java -jar lib\ldif-sieve-fpl-0.1.1-jar-with-dependencies.jar examples\dbpedia-multilang\SieveFPL.xml

from the directory you have put the binaries to.

5. Source Code and Development

The latest source code is available from the LDIF development page on GitHub.

The framework can be used under the terms of the Apache Software License.

6. Support and Feedback

For questions and feedback please use the LDIF Google Group.

7. References

  • Volha Bryl, Christian Bizer. Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion. 4th Joint WICOW/AIRWeb Workshop on Web Quality Workshop (WebQuality) @ WWW 2014. [pdf]

8. Acknowledgments

This work was supported by the EU FP7 grant LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943).