June 24, 2015 @ 3:30 pm – 4:30 pm
Court Room, First Floor

3.1 Is Europe Falling Behind in Data Mining? Copyright Law’s Impact on Data Mining in Academic Research

Christian Handke, The Erasmus University, The Netherlands
Lucie Guibault and Joan Josep Vallbé, University of Amsterdam, The Netherlands

This paper discusses how different levels of copyright protection affect the text and data mining (TDM) performance of academic researchers in the main research areas.

Copyright protection is determined at a national level. The scope of rights and exceptions varies per country: in some countries, exceptions expressly allow TDM to take place, while in others such activities are restricted. In most countries, the law is unclear. Statutory copyright exceptions, where they exist, can be interpreted in different ways. The assessment of the lawfulness of TDM falls back on the judgment of the researcher. Depending on the knowledge or perception of the law, TDM may be deemed allowed, probably allowed, probably not allowed or restricted. This paper assesses the consequences of the different levels of copyright protection on TDM activities.

Our aim was to explain the comparative variation in research output about data mining. For this, we collected data from Thomson Reuter’s Web of Science. To identify the research output of interest, we extracted all the published research from authors residing in the 31 largest national economies that contained the expression ‘data mining’ in the extended abstract, including 14 EU member states, for the years 1992 to 2014. As a control for the total research output of the respective countries, our dependent variable was the quotient between this absolute academic TDM output and the total research output from these countries. Our unit of analysis was the country-year proportion of TDM research output.

Other control variables included the rule of law (as reported by the World Bank), dealing with the level of enforcement of copyright, and the size and wealth of countries.

To estimate the effect of copyright law on the share of TDM in total research output, we fitted a multilevel linear regression model with varying intercept for country and year.

The data illustrate the rapid growth of TDM-related articles in total research output across all countries.We find a highly significant effect of copyright law: the more restrictive copyright law in most European countries is associated with a significantly lower share of TDM output. Data mining makes up a higher share of total research output in countries with more permissive copyright laws. Some Asian countries in particular over-perform in terms of their TDM research output. What is more, the share of TDM in total research output grows more rapidly in the less restrictive countries.

Lucie Guibault is Associate Professor at the Institute for Information Law in the University of Amsterdam (UvA). She studied law at the Université de Montréal (Canada) and in 2002 she receivedher doctorate from the University of Amsterdam, where she defended her thesis on ‘Copyright Limitations and Contracts: An Analysis of the Contractual Overridability of Limitations on Copyright’. She specialises in international and comparative copyright and intellectual property law, and has carried out research for the European Commission, Dutch ministries, UNESCO and the Council of Europe. Her main areas of interest include copyright and related rights in the information society, open content licensing, collective rights management, limitations and exceptions in copyright, and author’s contract law, She was coauthor of the Report of the Expert Group on Standardisation in the Area of Innovation and Technological Development, Notably in the Field of Text and Data Mining, written for the Directorate- General Research and Innovation, European Commission.

3.1 Is Europe Falling Behind in Data Mining


3.2 Can Computational Knowledge Discovery Tools Speed up Scientific Discovery?

Pinar Öztürk and Erwin Marsi, Norwegian University of Science and Technology, Norway
Natalia Manola, University of Athens, Greece

The inherent nature of environmental systems calls for interdisciplinary and collaborative research, which is in contrast with the traditional organisation of research around discipline-centric silos. The disconnectedness between marine biology, marine chemistry, socio-economics, etc. is the main barrier slowing down the speed of discovery of new knowledge about complex problems such as the impacts of climate change on nature and society. The main obstacle of scientific advancement in such problems has shifted from production of discipline-centric knowledge to linking together pieces of existing knowledge across disciplines. The management of such a vast body of knowledge is far beyond the capabilities of individual scientists.

Computer tools have so far focused on keyword-based document retrieval while Ocean-Certain(OC), an FP7 project, aims to develop tools for Literature-Based Discovery (LBD) of scientific knowledge about the role of oceans in the export of CO2 to the sediments. LBD is used in biomedicine but use of LBD in earth science is virtually unexplored. The principal idea is that using text mining, computers can machine-read huge amounts of literature, extract the fertile pieces of knowledge (in the form of entities, events and relations) and link them together to identify potential gaps and missing links in the published work, while inferring new knowledge. The ongoing work also investigates the coherence among the published results on a particular question, e.g., whether the growth of phytoplankton leads to better CO2 export. The tool aims to retrieve instances of positive and negative answers, whose ratio would indicate the degree of uncertainty in the collective knowledge. The online collaborative platform will ultimately allow researchers to see one another’s questions and the system responses, as well as giving feedback on which of the system inferred hypotheses seem plausible to pursue.

For maximum benefit to research communities the platform must overcome today’s legal and technical barriers. These are mostly publisher-imposed obstacles, which increase (i) the uncertainty for the final users and low uptake due to unclear licensing issues, thus influencing the ability for scientific reproducibility; (ii) technological complexity that requires going through licensing issues for constant verification; and (iii) hidden costs from the fact that many researchers need to repeat the same text mining processes in their own environment to nonshared content.

Pinar Öztürk is Associate Professor in the Department of Computer and Information Science in the Norwegian University of Science and Technology (NTNU). She received her PhD from the Department of Computer and Information Science at NTNU. Her focus is artificial intelligence methods including rule and model-based methods and machine learning. She applies AI methods for distributed decision-making, ontology building and knowledge extraction from text. She was project leader and worked on various small and larger projects funded by the Norwegian Research Council, EU and the industry. Currently, she is leading a work package in the Smart Power Grid project (funded by Utilities) and a task in the EU funded OCEAN-CERTAIN project (since November 2013) that focuses on text mining in Climate Science.
3.2 Can Computational Knowledge Discovery Tools Speed up Scientific Discovery