In the current era, we are producing data in far greater quantities than ever before.
Harnessing the data deluge has been recognised as having the potential to help find solutions for some of society’s biggest challenges, such as climate change, health and demographic change, depleting natural resources, and globalisation.
Whilst the benefits of access to data and the use of techniques such as Text and Data Mining (TDM) to analyse data have been widely acknowledged, the reality is that there are major barriers preventing access to and exploitation of data. These issues include a lack of legal certainty, restrictive licences provided by publishers, a skills gap and a lack of infrastructure.
This situation has created a need to foster agreement across disciplines and sectors about the real benefits of TDM. We need a strategy for the way forward in terms of creating the conditions for realising these benefits in a way that ensures a positive societal impact.
For this reason, 25 global experts from many different areas of specialisation — researchers, publishers, lawyers, lecturers — gathered in The Hague on 9-10 December 2014 to write the Hague Declaration. Their belief is that this Declaration will help shape ethical research practice, legislative reform and the development of open access policies and infrastructure. Below are profiles of each original participant.
Miguel Andrade received his Ph.D. in Biochemistry at the Universidad Complutense de Madrid in 1994. He trained at the post-doctoral level at the European Molecular Biology Laboratory in Heidelberg and Cambridge with Chris Sander and Peer Bork. His post-doctoral studies involved the development and application of computational methods for the analysis of gene and protein function and structure.
From 2003 to 2007, he was Assistant Professor in the Department of Medicine of the University of Ottawa and Scientist and Head of the Bioinformatics Group of the Ottawa Health Research Institute in Ottawa, Canada, where he was promoted to Senior Scientist in 2006.
In 2007 Miguel started the Computational Biology and Data Mining group, first at the Max Delbrück Center for Molecular Medicine in Berlin and since September 2014 at the Institute of Molecular Biology in Mainz. The group focuses on the development and application of computational methods that are used to research the molecular and genetic components of human disease.
Paul Ayris has been director of Library Services at University College London since 1997. He is also the UCL Copyright Officer, and Chair of the LERU (League of European Research Libraries) Community of Chief Information Officers. He chairs the OAI Organising Committee for the Cern Workshops on ‘Innovations in Scholarly Communication’ and JISC Collections’ Electronic Information Resources Working Group. He served as President of LIBER for four years. On 1 August 2013, he became Chief Executive of UCL Press. He has a PhD in Eccesiastial History and publishes on English Reformation Studies.
Lars Bjørnshauge is director of European Library relations at SPARC Europe. He is responsible for developing state-of-the art digital library services and for establishing the first transnational library consortia. His work contributed to the success of the first electronic-only library. Lars founded the Directory of Open Access Journals which is a comprehensive list of Open Access journals that use quality control systems for their content. Currently he is Vice-President of the Swedish Library Association. He was a past President of the Association of Danish Research Libraries and Senior Adviser to the National Library of Sweden. Lars is also a past member of the Open Access Working group of EUA (European University Association), Co-Chair of the LERU Working Group on Open Access, and co-author of the LERU Road Map on Open Access. He is also Chair of IFLA’s Open Access Taskforce and co-author of IFLA’s statement on Open Access.
Benjamin Bober has an MA in History from the University of Paris I Panthéon-Sorbonne and also graduated from École Nationale des Chartes and from École Nationale Supérieure des Sciences de l’Information et des Bibliothèques. He is currently a licensing manager and a metadata librarian at ABES, the French national bibliographical agency for higher education. He is involved in the ISTEX project, a €60 million e-content national licensing initiative aimed at creating a platform that will host content bought from academic publishers and let all members of the French academic community content mine. From this perspective, he is particularly interested in the quality of data and metadata provided by the publishers, and the impact it may have on content mining. He is also a member of SavoirsCom1, an association that focuses on knowledge commons policy.
Maja Bogataj Jančič is CEO and founder of Intellectual Property Institute (IPI) in Ljubljana, Slovenia, specialising in copyright, internet law, and intellectual property. The primary focus of IPI’s activities is to explore the challenges that digitisation has brought to intellectual property law and the importance of these for the progress of an information-based society. The Institute works in close cooperation with Slovenian universities, research institutions, industry, art centres and civil society groups. Its aim is to create a strong network of partnership with researchers and research institutions at home and abroad. Since its recent founding, the Institute has participated in many discussions regarding various intellectual property issues, and managed to mark the landscape of the debate with its perspective. Maja also heads up Creative Commons Slovenia.
Christoph Bruch is senior advisor for Strategy at Helmholtz Open Science Coordination Office. His focus is on legal aspects of concerning research publications and research data. He is a member of the Working Group on Legal Aspect of the German Priority Initiative “Digital Information” and chairs the Science Europe Task Force on Legal Aspects a subcommittee of the Science Europe Working Groups on Research Data and Open Access to Scientific Publications. He studied political sciences at Johan Wolfgang Goethe University in Frankfurt am Main and Free University Berlin. Before joining Helmholtz Association he held professional positions at the Free University Berlin, the German Institute for Urban Studies, and Max Planck Society. In an honorary capacity he is advocating access to knowledge via his engagement with Coalition for Action “Copyright for Education in Research”, European Network for Copyright in support of Education and Science.
Chris Ferguson is Joint Chief Editor of PLOS Biology. Chris has a science background in Microbiology and Immunology and was awarded a PhD in Developmental Biology from the University of Cape Town. After 6 years of postdoctoral research at Kings College London, Chris entered the publishing world and trained as an editor on the Trends Review Journal series at Cell Press. Having run the journal Trends in Immunology as Editor for two and half years, she resigned to join PLOS in 2007, as an editor on the flagship journal, PLOS Biology. As Chief Editor, she oversees the activities of the journal and its editorial staff, and serves on the senior editorial and the publication ethics teams at PLOS.
At PLOS we believe that TDM is an important research methodology that must be supported by the keepers of the scholarly literature. By making our content open access, PLOS is facilitating TDM. We hope to offer better access for TDM researchers moving forward. PLOS participates in industry efforts to further facilitate TDM and encourages all publishers to open their content stores to TDM efforts with minimal barriers or obstacles.
Will Greenacre is a Policy Officer at the Wellcome Trust, a global charitable foundation dedicated to achieving extraordinary improvements in human and animal health, where his areas of responsibility include research regulation and governance, European advocacy, data sharing and copyright. He has a background in biological sciences and science communication, with degrees from the Universities of Leicester and Bath, UK. The Wellcome Trust supports the brightest minds in biomedical research and the medical humanities. Our breadth of support includes public engagement, education and the application of research to improve health. We are independent of both political and commercial interests. The Wellcome Trust supports unrestricted access to the published outputs of research, and supports measures to enable text and data mining for non-commercial research purposes; we believe that enabling the use of text and data mining is a critical element in realising the value of content and data for economic and societal benefit, and to derive the maximum benefit from research literature and datasets generated through investment in publicly-funded research.
Lucie Guibault is associate professor at the Institute for Information Law (IViR) of the University of Amsterdam (UvA). She is specialized in international and comparative copyright and intellectual property law. Lucie Guibault has been carrying out research for the European Commission, Dutch ministries, UNESCO and the Council of Europe. Her main areas of interest include copyright and related rights in the information society, open content licensing, collective rights management, limitations and exceptions in copyright, and author’s contract law. She has done extensive research on open content copyright licensing issues and open access in science, and is one of the co-authors of the report commissioned by DG Research and Innovation on the Standardisation in the area of innovation and technological development, notably in the field of Text and Data Mining (April 2014)
Melissa Hagemann is a Senior Program Manager at the Open Society Information Program where she heads the Open Access to Research and Open Educational Resources initiatives. She has been involved in the development of the Open Access and Open Education movements, having co-organised the meetings which led to the Budapest Open Access Initiative (BOAI), the BOAI 10 Recommendations and the meeting which led to the Cape Town Open Education Declaration. Melissa has held several positions within the Open Society Foundations, including managing the Foundations’ Regional Library Program in Budapest, as well as the Science Journals Donation Program. She currently sits on the Advisory Board of the Wikimedia Foundation.
This meeting is at the heart of the Open Society Information Program’s Access to Knowledge field and we are supporting it through a partnership between our Copyright Reform and Open Access Initiatives. Personally I am interested in participating in it, as I have worked in the Open Access movement for over a decade and one of the promises of OA is facilitating TDM. Thus I want to do all we can to ensure that strong recommendations are developed which will thwart any mechanisms created by the toll-access publishers or regulators to prohibit TDM.
Kristiina Hormia-Poutanen is the director of Library Network Services at the National Library of Finland. The department is responsible for the coordination of national library infrastructure services for the Finnish libraries. The services include coordination of consortia activities, library systems management and development, national licensing and development of the digital library for the libraries, archives and museums in Finland. All Finnish universities, universities of applied sciences, tens of research institutes, all public libraries, Finnish museums and archives are the customers of National library.
In the Finnish research infrastructure evaluations in 2008 and 2013, two of the services today in production were selected to the research infrastructure roadmap. The services were National Electronic Library, FinELib and the National Digital Library user interface Finna (finna.fi). The term of the updated research infrastructure roadmap is 2014-2020.
Hormia-Poutanen is the president of LIBER foundation since 2014, having served as Vice President from 2010-2014. She is a member of the Steering Committee on Scholarly Communication and Research Infrastructures under LIBER. She is a member of Europeana’s Board and Executive Committee. Hormia-Poutanen is a member of the Open Science Finland strategy group, the National Digital Library steering group and the ICT Management Steering Group for the field of activity of the Ministry of Education and Culture (OpIT), which the Ministry of Education and Culture nominates. She is also member of the Public Interface consortium group and the National Ontology (finto.fi) steering group nominated by the board of National Library.
Puneet Kishor is an independent practitioner and consultant on open science and data, and a senior researcher at the University of Wisconsin-Madison. Puneet’s current projects are focused on text and data mining, the role of social contracts in sharing beyond copyright, ethical and quality considerations of citizen-sourced information, and open science in the global south. Previously, Puneet was the manager of science and data policy at Creative Commons (CC) where he worked on all aspects of the scientific information lifecycle to make it systemically open and collaborative. It was in this context that he worked on The Hague Declaration. Puneet arrived at CC via a rural development NGO in New Delhi, the World Bank in Washington DC, and data research at the University of Wisconsin-Madison. Puneet is a data wrangler, environmental scientist, geospatial developer and Charter Member of the Open Source GeoSpatial Foundation.
Martin Krallinger works at the Spanish National Cancer Research Center (CNIO). He is an expert in the field of biomedical text mining and has authored over 50 research articles on topics such as protein-protein interaction extraction, text mining for model organisms and mutation extraction and chemical text mining from the literature. He is one of the main organizers of the BioCreAtIvE Challenge, a key event in natural language processing and text mining of life sciences and biomedical literature.
There is a pressing need to advance in the systematic access to information in full text articles by the biomedical and life sciences community. This is key to improve the experimental settings used in research, facilitate the interpretation of large-scale experiments and patient data well as to generate new hypotheses. Several studies have shown that using abstracts, as opposed to full text articles can recover only up to 20-30%, of relevant entities and relationships described in published articles.
Ignasi Labastida, PhD in Physics. Currently Head of the Office for Knowledge Dissemination and the Research Unit at the CRAI (Library) of the University of Barcelona. Public leader of Creative Commons in Spain. Member of the Copyright Working Group of LIBER and member of the Steering Committee of the CIO Community of LERU.
I am interested in TDM because part of my work is to assist researchers from my institution to do their work. They need clear policies, examples and best practices. They ask for guidelines to do their work and they want to avoid any legal trouble. At the same time I also work with the research vice-rectors drafting some policies and it could be a good chance to establish a position regarding this issue.
Natalia Manola is a Senior Software Engineer holding a B.Sc. in Physics from the University of Athens, Greece, and an M.Sc. in Electrical and Computer Engineering from the University of Wisconsin at Madison, USA. Her professional experience consists of several years of employment as a Software Engineer, Software Architect, and Project Manager by companies in various Information Technology sectors. She has participated and technically managed several R&D projects funded by the EC (DIAS, DRIVER, DRIVER-II, CHESS, ESPAS) or by the national government. Since Dec 2008 she is the director of the OpenAIRE infrastructure.
OpenAIRE operations extensively use text mining techniques on publicly funded research results that are harvested systematically from a wide range of sources (repositories, OA journals, publishers dbs, scholarly societies, funder dbs) to identify entities such as project IDs, people, organizations, citations and references to publications and data; classify and cluster publications and other project documents (e.g., abstracts, possibly deliverables, progress reports, etc.) for research analytics purposes, and to annotate and link entities/objects.
Repositories have not yet established clear policies and licenses for text mining, as this is an emerging trend, especially as an infrastructural servic; OA Journals and Scholarly societies have not adopted uniform solutions (licenses, policies, APIs). OpenAIRE plans to extend its mining processes to patents, PSI data and any other source that may be related to research and scholarly communication.
Jo McEntyre is Team Leader for Literature Services at the European Bioinformatics Institute (EMBL-EBI), which runs Europe PubMed Central, the European database for life science research articles. Before joining the EMBL-EBI, Dr McEntyre was a scientist at the NCBI, NIH, USA where she worked on various literature-related resources, and before that, was the Editor of the journal: Trends in Biochemical Sciences.
My primary interest in text and data mining is from the perspective of integrating the life sciences literature with related data, being based at the European Bioinformatics Institute (EMBL-EBI). As the service provider for Europe PMC, we seek to develop Europe PMC as a platform for text-mining groups to run cloud-based applications that deliver added-value content enrichment, cross-linking and information retrieval on the core literature content. Our group at the EMBL-EBI also undertakes TDM activities directly. The aspiration is that applications developed either by ourselves or others contribute to the workflows and discovery processes of data curators and end-users of bioinformatics resources (including the literature).
Eva Méndez Rodriguez has been a lecturer at Universidad Carlos III de Madrid since March 1997 and Tenured Professor since May 2008. She has also taught and carried out research at other Spanish and foreign universities and educational institutions. She has been an active member of several international working groups and research teams on various standards for the Web and description of electronic resources. She is member of the US Academy Louis Round Wilson-Knowledge Trust and the Advisory Committee of the DCMI (Dublin Core Metadata Initiative), where she is also co -chair of the DCMI Social Tagging community. During the 2005-06 academic course she was awarded a Fulbright Research Scholarship, as part of the European Union programme, at the Metadata Research Center at Chapel Hill University North Carolina (USA).
She has taken part in and led several research projects and acted as advisor to many more in the fields of normalisation, metadata, semantic web, open data, digital repositories and libraries, in addition to information policies for development in several countries such as the Dominican Republic. Since 2006 she has participated as an independent European Commission expert on the assessment and monitoring of various projects for a number of programmes such as 7PM, ICT-PST, eContentPlus, in the fields of the Europeana digital library, Technologies applied to education (TEL) and Open Science. From 2009 to 2012 she was Director of the University Master’s degree in Digital Information Libraries and Services, and since September 2011 she has been Deputy Vice Chancellor of Infrastructures and Environment.
Peter Murray-Rust is a chemist currently working at the University of Cambridge. As well as his work in chemistry Murray-Rust is also known for his support of open access and open data. He was educated at Bootham School and Balliol College, Oxford. After obtaining a Doctor of Philosophy he became lecturer in chemistry at the (new) University of Stirling and was first warden of Andrew Stewart Hall of Residence. In 1982 he moved to Glaxo Group Research at Greenford to head Molecular Graphics,Computational Chemistry and later protein structure determination. He was Professor of Pharmacy in the University of Nottingham from 1996-2000, setting up the Virtual School of Molecular Sciences. He is now Reader in Molecular Informatics at the University of Cambridge and Senior Research Fellow of Churchill College, Cambridge. Peter is also known for his work on making scientific knowledge from literature freely available, and in such taking a stance against publishers that are not fully compliant with the Berlin Declaration on Open Access. In 2014 he actively raised awareness of glitches in the publishing system of Elsevier, where restrictions were imposed by Elsevier on the reuse of papers after the authors had paid Elsevier to make the paper freely available.
$500,000,000,000 of public funding goes into STEM research but most (perhaps 85%) is wasted to the world through poor or non-publication, duplication and flawed design. Machines can, in principle, liberate a significant part of this but we are prevented by apathy, 19thC attitudes and technology, and active opposition from vested interests. This makes me angry. In a blogpost in 2012, I asserted the mantra: “The right to read is the right to mine.” We have now, with the help of the Shuttleworth Foundation, built a universal, technical and social infrastructure, http://contentmine.org. This is Free/Open to anyone and is customisable to allow facile and massive extraction of the factual content of the STEM literature. We have fought for this right in the UK, and won it, so our activities are fully legal. We and others have also been fighting for this in Europe. We have a wide range of Free/Open resources: code, training, community, political and legal.
Pinar Öztürk is associate professor in the Department of Computer and Information Science at Norwegian University of Science and Technology, NTNU, Trondheim. She has been a project manager and participant in several national-level and European Union projects dealing with decision support systems.
Her main research area is artificial intelligence and she does multidisciplinary research linking AI with other parts of cognitive science. Her research activities lie in knowledge representation and modelling, case-based reasoning, multiagent systems, and recently text mining/information extraction areas. The last one is particularly in the context of literature-based scientific discovery related to an EU project in FP7 framework, Ocean Certain, focusing on understanding the impacts of climate change on the marine species and carbon sequestration.
Her group aims to develop text mining tools that can help to speed up the knowledge discovery through linking scientific knowledge across various disciplines.
Susan Reilly is executive director of LIBER, The Association of European Research Libraries. She has led LIBER’s advocacy activities in the areas of TDM, open access and copyright. She has also worked across a range of EU projects relating to open access, e-science, and digital libraries. She has recently contributed to the LERU Roadmap for Open Access to Research Data and has co-authored a study, for the European Commission, to identify recommendations for a single pan-European authorisation, authentication and accounting (AAA) infrastructure for research information resources. She holds an MSc in Information Management from the University of Sheffield, and has several years’ experience in library management.
Supporting TDM will become a core part of what research libraries do. The mission of libraries is to ensure access to information and support the generation of knowledge. In the digital age this means ensuring that digital content is available in a way that researchers can read and exploit it using current and innovative tools and practices.
Neil Richards is an internationally-recognized expert in the fields of privacy, First Amendment, and information law. His recent work explores the complex relationships between free speech and privacy in cyberspace.
Professor Richards’ articles have appeared or are forthcoming in the Harvard Law Review, Columbia Law Review, California Law Review, Virginia Law Review, and Georgetown Law Journal. His book, Intellectual Privacy, will be published by Oxford University Press in 2014. Professor Richards also co-directs both the Washington University-Cambridge University International Privacy Law Conference and the Washington University Free Speech Conference. Professor Richards is a recipient of the Washington University student body’s David M. Becker Professor of the Year Award.
Prior to joining the law faculty in 2003, he practiced law in Washington, D.C. with Wilmer, Cutler, and Pickering, where he specialized in appellate litigation and privacy law. He is also a former law clerk to Chief Justice William H. Rehnquist, and Judge Paul V. Niemeyer of the United States Court of Appeals for the Fourth Circuit. More recently, he successfully represented a St. Louis fantasy sports company in high-profile litigation against Major League Baseball. He was the inaugural Hugo Black Fellow at the University of Alabama Law School and a Temple Bar Fellow with the Inns of Court in London.
Nilu Satharasinghe is CTO of Sparrho, a recommendation engine for scientific content. We focus on connecting our users to the latest scientific content they need to see. People have the ability to make non-linear connections. When combined with others and amplified by technology they become exponentially more valuable. We do this by processing free-to-read information such as RSS feeds and HTML. We analyse this publicly available content in order to draw connections between it and our users.
Our interest in TDM is pragmatic, we are concerned about the added complexity that increased licensing brings. Restrictions hinder our ability to analyse sufficient data to be useful. Furthermore requesting license permission can be a difficult and time consuming process, let alone trying to determine who the license holders are. We believe the right-to-read is the right-to-mine. Our machines are simply reading content and making notes. This could be done by hand, though doing so for a reasonable dataset would take decades. Providing purely non-commercial exceptions is insufficient as in science the two are not mutually exclusive. Commercial resources aid in funding academic efforts which in turn can often lead to further commercialisation. This increase in business risk can force SMEs to consider relocation in order to survive.
Alek Tarkowski is Director of Centrum Cyfrowe Projekt: Polska, a think-and-do-tank building a digital civic society in Poland. Public Lead of Creative Commons Poland, the Polish branch of the global organization promoting flexible copyright models for creators, for which he also works in the individual capacity of a European Policy Advisor. Member of the Polish Board of Digitisation, an advisory body to the Polish Ministry of Administration and Digitisation. Member of the Administrative Council of Communia, a European association supporting the digital public domain. Vice-Chairman of the Polish Coalition for Open Education (KOED).In 2007-2011 member of the Board of Strategic Advisors to the Prime Minister of Poland, responsible for issues related to the development of digital society. Co-author of the report “Poland 2030” and the Polish official long-term strategy for growth. Centrum Cyfrowe is in Poland one of the key organisations advocating for copyright reform and open policies. Our particular focus is on reforms that support public interest goals, for education, libraries or research.
I have been involved for the last several years in policy debates on open education, open science, and copyright reform, both in Poland and in Europe (through Communia and the Copyright for Creativity coalition). My particular focus is on open policies for education, science and culture – and on policy approaches that seek general solutions to the issue of availability of content. I was involved, on behalf of Communia, in the TDM Working Group during the Licenses for Europe process. I am interested in policy approaches that combine support for open policies with copyright reform efforts (as demonstrated in the Creative Commons statement on copyright reform, which I co-authored). I consider TDM important as it is an issue that can be approached from both of the abovementioned perspectives; in light of a strong case for new exceptions and limitations for TDM; and as an argument for the need of strong, free licensing of content in face of a lack of such exceptions.
Staffan Truvé has helped launch more than a dozen software companies, including Spotfire (now part of Tibco), Appgate (now part of CryptZone), Axiomatics (secure, role based access control), and Recorded Future (threat intelligence). In 2009, he co-founded Recorded Future, a pioneer in web intelligence. The company currently employs 45 persons in the US and Sweden and is funded by Google Ventures, Atlas Venture, Balderton, IA Ventures and IQT. The company’s goal is to organize the web for analysis, by doing linguistic analysis and allowing for quantitative and trend oriented studied of web content, as well as enabling predictive analytics. From 2005-2009, he was CEO of the Swedish Institute of Computer Science (SICS), and Interactive Institute, managing an organization of about 200 researchers. From 1994 to 2003, he worked as CEO of CR&T, a Swedish research-oriented consulting company and technology incubator. Prior to that (1992-1994), he was Chief Architect at Carlstedt Elektronik, developing a novel computer system for distributed real time systems. He holds a PhD in computer science from Chalmers University of Technology. He has been a visiting Fulbright Scholar at MIT and holds an MBA from Göteborg University. His research interests include parallel and distributed computing, computer architecture, compilers, computer vision, natural language processing, information visualization, and open source intelligence.
Benjamin White, IFLA. The International Federation of Library Associations and Institutions (IFLA) is the leading international body representing the interests of library and information services and their users, with over 1500 members in 150 countries. It is the global voice of the library and information profession, and is dedicated to promote the highest standards of library and information provision. As part of this mission, and in particular through the workings of its Copyright and Legal and Other Legal Matters Committee IFLA understands the importance of ensuring that copyright and related laws reflect how information can be used as technology changes. As the building blocks of knowledge and freedom of expression it is vital that facts and data can be freely reused by the many differing societal and economic groups who use the services of libraries.