‘The most important prerequisite for applying state-of-the-art text analysis techniques is the availability and openness of information regarding foodborne diseases, outbreaks, food alerts and recalls.’
Giannis Stoitsis, Nikos Marianos and Nikos Manouselis
According to the WHO, foodborne and waterborne diseases kill an estimated 2.2 million people annually, most of whom are children. This has important health and economic implications. In Europe, for example, 1.2 million cases of food borne diseases are reported annually, leading to 350,000 hospitalisations and 5,000 deaths. The estimated economic cost to Europe is €117 billion/year.
The early detection of outbreaks in this area is a major challenge and is hampered by the fact that decision makers in both the public and private sector, food scientists, microbiologists and epidimiologists working on food safety topics cannot take full advantage of all the existing data for foodborne diseases.
This is mainly because data tends to be unstructured, is kept in internal databases or is stored in customised formats which cannot be easily shared in an interoperable way.
Text mining tools and methods would help in numerous ways, including:
The most important prerequisite for applying state-of-the-art text analysis techniques is the availability and openness of information regarding foodborne diseases, outbreaks, food alerts and recalls.
‘There are serious legal hurdles in terms of using full text scientific literature for text mining purposes, starting with the under-specified metadata associated to articles to be processed automatically. There are also no clear guidelines on how much content from full text papers can be redistributed either to carry out manual searches or additional text mining analysis. It is moreover a challenge to identify unambiguously the necessary published information and contact person in order to request text mining usage exemptions.’
Martin Krallinger, Alfonso Valencia
Spanish National Cancer Research Centre
One of the key issues for the implementation, advancement and practical usefulness of TDM is the construction of a suitable Gold Standard dataset to train and assess the performance of the used methodologies. These rely increasingly on supervised and semi-supervised machine learning algorithms. In this context we organised a series of international community challenge called BioCreative (Critical Assessment of Information Extraction systems in Biology). Our aims were to:
Four official BioCreative challenges and two side events have taken place so far. More than 300 developers of participating text mining and natural language processing systems have taken part. The organization of these events was motivated by the increasing number of groups working in the area of text mining. However, despite increased activity in this area, there were no common standards or shared evaluation criteria to enable comparison among the different approaches.
The various groups were addressing different problems, often using private data sets, and as a result, it was impossible to determine how good the existing systems were, whether they would scale to real applications, and what performance could be expected. A common issue with those private datasets was restriction in terms of copyright issues to make the text annotations available to other research teams.
Since there is a considerable difficulty in constructing suitable ‘gold standard’ data for training and testing new information extraction systems that handle life science literature, the data sets derived from the BioCreative challenge are particularly important. Biological database curators and domain experts have examined them and this makes the datasets useful resources for the development of new applications and the improvement of existing data.
‘In many cases we are required to mine the full-text content in order to determine the licence of the content. The licence is typically not provided as part of the metadata. We are currently depending on the recent UK Copyright Exception for text-mining to do so, but a European-wide approach would be helpful.’
COnnecting REpositories (CORE) is a not-for-profit service run by the Knowledge Media institute, Open University. Our research into aggregating and text-mining of research papers, supported by a range of funders including Jisc and the European Commission, has resulted in the creation of a platform with a number of applications built on top of it, providing benefits to a range of stakeholders and the general public.
CORE contains over 20 million open access research papers from worldwide repositories and journals and is used by over 90,000 unique visitors every month. By processing both full-text and metadata, CORE serves three communities:
As part of its work, CORE uses text and data mining methods on its aggregated papers in order to:
‘Those creating models from corpora are generally unable to tell what license the model has. Still, it is a general practice to make such models publicly available for download. While this is convenient for researchers, it is a problem for builders of research infrastructures, because In order to distribute language models through a federation of online repositories, it is important to know their license status.’
Richard Eckart de Castilho, Iryna Gurevych
UKP Lab, Technische Universitat Darmstädt
Currently, TDM relies significantly on language models that describe language on an abstract level, often through absolute and relative frequencies and weights learned from a text corpus. Such models are required at every stage of language analysis, from relatively simple text segmentation up to entity extraction, fact extraction, sentiment analysis, and so on.
The DKPro Core infrastructure aims to package these models, making them available as part of the infrastructure and deploying them along with various tools through a common application programming interface. It allows users to:
One of the major challenges currently faced by the DKPro Core infrastructure is a lack of licenses attached to models, even though these models are usually publicly available for download. This is a problem for builders of research infrastructures. In order to distribute language models through a federation of online repositories, it is important to know their license status. Without this information, our ability to make such models available as part of the infrastructure is significantly constrained.
The legal status of language models is also a problem for commercial users. Questions about language model licenses come up often and typically remain without a clear answer or even with no answer at all.
‘We are interested in knowledge discovery related to the science of Climate Change. We are particularly interested in detecting events and the causes, possibly causal chains, leading to these events. Hence, entity and event detection and relation extraction are important parts of our work. For this, we need access to scientific publications and the rights to perform text mining on them.’
Pinar Öztürk, Erwin Marsi
Norwegian University of Science and Technology (NTNU)
Our goal is to design and develop a discovery support system to be used by researchers in various disciplines in the Climate Change domain. This is a cross-disciplinary domain where scientists in marine chemistry, marine biologists, geologists, environmental sciences, etc. are trying to better understand the impacts of climate change on the marine food web and the related process of CO2 sequestration through the biological pump.
This complex problem attracts many researchers in various disciplines, leading to a huge number of publications. The volume of publications consequently makes it impossible for any researcher to find and read all publications relevant to his/her own research. Differences in the terminologies of various disciplines exacerbate this situation. With so much latent public knowledge out there waiting to be discovered. Computational support for the process of scientific discovery seems an obvious direction in which to go.
‘Through the Khresmoi project, a search and access system for biomedial information and documents was developed. This EC FP7 project has since been taken up by SMEs and commercialised via H2020 funding.’
University of Sheffield, United Kingdom
The project developed a system by which information could be automatically extracted from biomedical documents. Improvements to this information were made through manual annotation and active learning, as well as automated estimation of the level of trust and expertise of target users. Information extracted from unstructured or semi-structured biomedical texts and images was linked to structured information in knowledge bases. Khresmoi supported searches in many language and returned machine-translated pertinent excerpts.
The project was important because:
Unfortunately, copyright and licensing issues were an issue. Although many web pages are clearly unrestricted, some contain explicit copyright restrictions. In the case of American sites, it could be argued that what Khresmoi was doing was fair use for research as stated by the United States copyright law. However, this would not be the case if Khresmoi were to be commercially exploited. Additionally, other non-USA sites contained restrictive copyrights.