Sieve data quality assessment and data fusion framework provides a library of quality assessment scores and fusion functions, and an XML-based input specification language to combine them. However, manually defining a fusion policy - that is, which fusion function and with what parameters to use for which data attribute - requires certain domain knowledge and understanding of the input data, is time-consuming, and does not guarantee an optimal result.
Therefore, we introduce Fusion Policy Learner, an extension of Sieve that allows automatically selecting an optimal fusion function based on the gold standard. The user is still involved in the process as the list of possible fusion functions per property needs to be manually specified.
The learning algorithm implemented in the Fusion Policy Learner selects the fusion function that minimizes the error with respect to a gold standard. The list of fusion functions and the respective quality assessment scores for each property are defined in the input specification, which is written using the extended version of the XML-based Sieve specification language.
The learning algorithm first detects, based on the gold standard, whether the values to fuse are numeric or nominal; examples of nominal values are strings and URIs.
Then for numeric properties one of the two learning strategies is applied: (1) a fusion function that minimizes the mean absolute error with respect to the gold standard is selected, or (2) given a maximum error threshold (e.g. 5%), the function that maximizes the number of values that deviates from the gold standard no more than by a threshold, is selected. In case of nominal values, the fusion function that produces the maximum number of exact matches with the gold standard is selected.
Below you see an example of input specification for the Fusion Policy Learner:
1 <SieveFPL> 2 <Parameters> 3 <!--SelectionMethod name="MinAbsError"/--> 4 <SelectionMethod name="MaxCorrectValues" error="0.05"/> 5 </Parameters> 6 <Input> 7 <GoldStandard>gold\cities1000-Netherlands.gold.nt</GoldStandard> 8 <dumpLocation>dumps-nl</dumpLocation> 9 <SieveExec>c:\ldif-0.5.2\bin\ldif.bat</SieveExec> 10 </Input> 11 <Output> 12 <SieveSpec>sieve-optimal\sieve_optimal.xml</SieveSpec> 13 <FPLReport>FPL_report.txt</FPLReport> 14 <!--FPLReport valmatrix = "true">FPL_report.txt</FPLReport--> 15 </Output> 16 <Sieve xmlns="http://www4.wiwiss.fu-berlin.de/ldif/"> 17 <Prefixes> 18 <Prefix id="dbpedia-owl" namespace="http://dbpedia.org/ontology/"/> 19 <Prefix id="ldif" namespace="http://www4.wiwiss.fu-berlin.de/ldif/"/> 20 <Prefix id="sieve" namespace="http://sieve.wbsg.de/vocab/"/> 21 <Prefix id="dbpedia-meta" namespace="http://dbpedia.org/metadata/"/> 22 </Prefixes> 23 <QualityAssessment> 24 <AssessmentMetric id="sieve:authactivity"> 25 <ScoringFunction class="NormalizedCount"> 26 <Param name="maxCount" value="4250000"/> 27 <Input path="?GRAPH/dbpedia-meta:autheditcnt"/> 28 </ScoringFunction> 29 <AssessmentMetric id="sieve:recency"> 30 <ScoringFunction class="TimeCloseness"> 31 <Param name="timeSpan" value="500"/> 32 <Input path="?GRAPH/dbpedia-meta:lastedit"/> 33 </ScoringFunction> 34 </AssessmentMetric> 35 </QualityAssessment> 36 <Fusion> 37 <Class name="dbpedia-owl:PopulatedPlace"> 38 <Property name="dbpedia-owl:areaTotal"> 39 <FusionFunction class="KeepFirst" metric="sieve:recency"/> 40 <FusionFunction class="KeepFirst" metric="sieve:authactivity"/> 41 <FusionFunction class="Voting"/> 42 <FusionFunction class="Average"/> 43 </Property> 44 <Property name="dbpedia-owl:populationTotal"> 45 <FusionFunction class="KeepFirst" metric="sieve:recency"/> 46 <FusionFunction class="Average"/> 47 <FusionFunction class="Maximum"/> 48 </Property> 49 </Class> 50 </Fusion> 51 </Sieve> 52 </SieveFPL>
The root tag is <SieveFPL>, under which 4 elements have to be specified: parameters of the learning algorithm (lines 2-5), input (lines 7-11) and output (lines 12-15) paths, and the extended Sieve specification (lines 16-20).
The <SelectionMethod> element (lines 3-4) is used to specify the parameters of the learning algorithm for numeric data values. MinAbsError value of the name attribute corresponds to the 1st learning strategy (see above), and MaxCorrectValues – to the 2nd one with error attribute defining the maximum error threshold.
In the <Input> section the following parameters are specified:
The <Output> section contains the following two paths to the output files generated by the tool:
http://dbpedia.org/resource/Buitenpost 5834 (en) 5777 (average) 5834 (en) 5764 (gold) http://dbpedia.org/resource/Noordwijkerhout 15541 (it) 15460 (average) 15541 (ru) 15071 (gold) http://dbpedia.org/resource/Harenkarspel 15922 (ru) 15973 (average) 16076 (it) 15941 (gold)
<Sieve> element contains the extended or "redundant" Sieve specification: for each property a list of fusion functions is defined, and the FPL selects an optimal one from the list with respect to the gold standard.
In lines 39-42 4 fusion functions - keep the most recent value, keep the value added by the most active author, most frequent, average - are specified for the areaTotal property of a populated place. The FPL chooses and puts into the final Sieve specification (line 12) only one of these functions, in accordance with the selection method defined in <Parameters>.
In the current FPL version learning can be performed for only one class at a time, which means that learning an optimal fusion policy for the properties of e.g. dbpedia-owl:CelestialBody would require another specification file.
*** Learning an optimal fusion function for dbpedia-owl:populationTotal property *** Number of gold standard values = 493 According to the gold standard, dbpedia-owl:populationTotal is NUMERIC Pool of fusion functions: 0 : <FusionFunction class="KeepFirst" metric="sieve:recency"/> 1 : <FusionFunction class="Average"/> 2 : <FusionFunction class="Maximum"/> Errors per fusion function (functions identified by int ID): 0, mean absolute error : 0.02368750565457104, count : 493.0 0, number of 5.0% correct values : 318 1, mean absolute error : 0.022413082496238555, count : 493.0 1, number of 5.0% correct values : 301 2, mean absolute error : 0.021932866337685517, count : 493.0 2, number of 5.0% correct values : 329 MinAbsError: best fusion function ID, error %, count: 2, 2.193286633768552, 493.0 MaxCorrectValues: best fusion function ID, number of correct values : 2, 329
The report contains the number of gold standard values for dbpedia-owl:populationTotal property (493), the detected property type (numeric), the list of the fusion functions as defined for populationTotal in the FPL specification. Fusion functions are assigned numeric IDs (0, 1, 2), which are then used to report errors for each functions with respect to the gold standard.
For numeric properties, errors are reported for both learning methods: MinAbsError and MaxCorrectValues (with the default 5% threshold), and the optimal (referred as "best" in the report) functions for both methods are listed. In our example, both selection methods resulted in maximum fusion function to be the optimal one. The final Sieve specification will include only the function which is optimal according to MaxCorrectValues method, as defined in line 4 of our sample FPL specification.
In order to demonstrate the functionalities of the Fusion Policy Learner, the multilingual DBpedia example is distributed with the LDIF binaries (dbpedia-multilang directory) and can be found in the LDIF repository. The example aims at fusing data for the same city from multiple language editions of DBpedia. In the example directory you find the input specification (SieveFPL.xml) along with data and provenance metadata dumps (dumps-3cities and dumps-nl directories, use one of the two when specifying dumpLocation in line 8) and gold standard (in gold directory).
To run the FPL with the multilingual DBpedia example, download and unpack the LDIF binaries, and run
java -jar lib\ldif-sieve-fpl-0.1.1-jar-with-dependencies.jar examples\dbpedia-multilang\SieveFPL.xmlfrom the directory you have put the binaries to.
The latest source code is available from the LDIF development page on GitHub.
The framework can be used under the terms of the Apache Software License.
For questions and feedback please use the LDIF Google Group.
This work was supported by the EU FP7 grant LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943).