1. 1
  • Read the first-hand experiences of companies, universities and researchers who use TDM, plus reports giving further background.


CASE #1 – Agroknow

‘The most important prerequisite for applying state-of-the-art text analysis techniques is the availability and openness of information regarding foodborne diseases, outbreaks, food alerts and recalls.’

Giannis Stoitsis, Nikos Marianos and Nikos Manouselis

According to the WHO, foodborne and waterborne diseases kill an estimated 2.2 million people annually, most of whom are children. This has important health and economic implications. In Europe, for example, 1.2 million cases of food borne diseases are reported annually, leading to 350,000 hospitalisations and 5,000 deaths. The estimated economic cost to Europe is €117 billion/year.

The early detection of outbreaks in this area is a major challenge and is hampered by the fact that decision makers in both the public and private sector, food scientists, microbiologists and epidimiologists working on food safety topics cannot take full advantage of all the existing data for foodborne diseases.

This is mainly because data tends to be unstructured, is kept in internal databases or is stored in customised formats which cannot be easily shared in an interoperable way.

Text mining tools and methods would help in numerous ways, including:

  1. The extraction of vocabulary and ontology terms from databases, in order to create a common semantic vocabulary which can operate as the backbone for harmonizing the information is needed.
  2. The extraction of structured data from trusted web sources containing information about foodborne diseases, outbreaks and recalls.
  3. The extraction of secondary data from publications and reports, such as table with results, images and genetic information (DNA of bacteria).
  4. The automatic linking of food outbreaks’ reports with food recall data.

The most important prerequisite for applying state-of-the-art text analysis techniques is the availability and openness of information regarding foodborne diseases, outbreaks, food alerts and recalls.

Download the full case study.

CASE #2 – Spanish National Cancer Research Centre

‘There are serious legal hurdles in terms of using full text scientific literature for text mining purposes, starting with the under-specified metadata associated to articles to be processed automatically. There are also no clear guidelines on how much content from full text papers can be redistributed either to carry out manual searches or additional text mining analysis. It is moreover a challenge to identify unambiguously the necessary published information and contact person in order to request text mining usage exemptions.’

Martin Krallinger, Alfonso Valencia
Spanish National Cancer Research Centre

One of the key issues for the implementation, advancement and practical usefulness of TDM is the construction of a suitable Gold Standard dataset to train and assess the performance of the used methodologies. These rely increasingly on supervised and semi-supervised machine learning algorithms. In this context we organised a series of international community challenge called BioCreative (Critical Assessment of Information Extraction systems in Biology). Our aims were to:

  1. Evaluate the performance and limitations of text mining and information extraction systems applied to the biomedical domain.
  2. Promote the construction of suitable training text corpora.
  3. Motivate the implementation of practical and useful biomedical text mining applications.

Four official BioCreative challenges and two side events have taken place so far. More than 300 developers of participating text mining and natural language processing systems have taken part. The organization of these events was motivated by the increasing number of groups working in the area of text mining. However, despite increased activity in this area, there were no common standards or shared evaluation criteria to enable comparison among the different approaches.

The various groups were addressing different problems, often using private data sets, and as a result, it was impossible to determine how good the existing systems were, whether they would scale to real applications, and what performance could be expected. A common issue with those private datasets was restriction in terms of copyright issues to make the text annotations available to other research teams.

Since there is a considerable difficulty in constructing suitable ‘gold standard’ data for training and testing new information extraction systems that handle life science literature, the data sets derived from the BioCreative challenge are particularly important. Biological database curators and domain experts have examined them and this makes the datasets useful resources for the development of new applications and the improvement of existing data.

Download the full case study.

CASE #3 – COnnecting REpositories (CORE)

‘In many cases we are required to mine the full-text content in order to determine the licence of the content. The licence is typically not provided as part of the metadata. We are currently depending on the recent UK Copyright Exception for text-mining to do so, but a European-wide approach would be helpful.’

Petr Knoth

COnnecting REpositories (CORE) is a not-for-profit service run by the Knowledge Media institute, Open University. Our research into aggregating and text-mining of research papers, supported by a range of funders including Jisc and the European Commission, has resulted in the creation of a platform with a number of applications built on top of it, providing benefits to a range of stakeholders and the general public.

CORE contains over 20 million open access research papers from worldwide repositories and journals and is used by over 90,000 unique visitors every month. By processing both full-text and metadata, CORE serves three communities:

  1. Developers, text-miners, scientometricians and others who need large-scale machine access to research papers.
  2. Researchers and the general public who need better, free access to research literature.
  3. Funders and government organisations needing to discover scientific trends and evaluate research impact.

As part of its work, CORE uses text and data mining methods on its aggregated papers in order to:

  • Extract information from research papers, including basic and advanced metadata, citations and unique identifiers.
  • Recommend content of related research papers.
  • Match papers to patents, funding opportunities and open courses to support a range of stakeholders.
  • Mine the licence of research papers to determine if they are compatible with the open access definition.
  • Support scientific knowledge discovery by improving access to research literature.
  • Categorise papers to determine the subject class and allow the monitoring of research trends.

Download the full case study.

CASE #4 – DKPro Core

‘Those creating models from corpora are generally unable to tell what license the model has. Still, it is a general practice to make such models publicly available for download. While this is convenient for researchers, it is a problem for builders of research infrastructures, because In order to distribute language models through a federation of online repositories, it is important to know their license status.’

Richard Eckart de Castilho, Iryna Gurevych
UKP Lab, Technische Universitat Darmstädt

Currently, TDM relies significantly on language models that describe language on an abstract level, often through absolute and relative frequencies and weights learned from a text corpus. Such models are required at every stage of language analysis, from relatively simple text segmentation up to entity extraction, fact extraction, sentiment analysis, and so on. 

The DKPro Core infrastructure aims to package these models, making them available as part of the infrastructure and deploying them along with various tools through a common application programming interface. It allows users to:

  • Assemble TDM applications by integrating TDM tools into a common framework.
  • Create reproducible TDM research based on stable releases of the portable DKPro Core software infrastructure.
  • Deploy text analytics close to the text data to be processed. This allows the protection of potentially sensitive data and avoids transferring potentially large amounts of data over the network unnecessarily. 

One of the major challenges currently faced by the DKPro Core infrastructure is a lack of licenses attached to models, even though these models are usually publicly available for download. This is a problem for builders of research infrastructures. In order to distribute language models through a federation of online repositories, it is important to know their license status. Without this information, our ability to make such models available as part of the infrastructure is significantly constrained.

The legal status of language models is also a problem for commercial users. Questions about language model licenses come up often and typically remain without a clear answer or even with no answer at all.

Download the full case study.

CASE #5 – TDM For Climate Change Science

‘We are interested in knowledge discovery related to the science of Climate Change. We are particularly interested in detecting events and the causes, possibly causal chains, leading to these events. Hence, entity and event detection and relation extraction are important parts of our work. For this, we need access to scientific publications and the rights to perform text mining on them.’

Pinar Öztürk, Erwin Marsi
Norwegian University of Science and Technology (NTNU)

Our goal is to design and develop a discovery support system to be used by researchers in various disciplines in the Climate Change domain. This is a cross-disciplinary domain where scientists in marine chemistry, marine biologists, geologists, environmental sciences, etc. are trying to better understand the impacts of climate change on the marine food web and the related process of CO2 sequestration through the biological pump.

This complex problem attracts many researchers in various disciplines, leading to a huge number of publications. The volume of publications consequently makes it impossible for any researcher to find and read all publications relevant to his/her own research. Differences in the terminologies of various disciplines exacerbate this situation. With so much latent public knowledge out there waiting to be discovered. Computational support for the process of scientific discovery seems an obvious direction in which to go.

Download the full case study.

CASE #6 – Semantic Annotation of Health Information

‘Through the Khresmoi project, a search and access system for biomedial information and documents was developed. This EC FP7 project has since been taken up by SMEs and commercialised via H2020 funding.’

Angus Roberts
University of Sheffield, United Kingdom

The project developed a system by which information could be automatically extracted from biomedical documents. Improvements to this information were made through manual annotation and active learning, as well as automated estimation of the level of trust and expertise of target users. Information extracted from unstructured or semi-structured biomedical texts and images was linked to structured information in knowledge bases. Khresmoi supported searches in many language and returned machine-translated pertinent excerpts.

The project was important because:

  1. Members of the general public frequently seek medical information online. This process is currently inefficient, unreliable and potentially dangerous. It is thus important that they are provided with reliable and understandable medical information in their own language.
  2. Medical doctors need rapid and accurate answers – a search of MEDLINE takes on average 30 minutes, while doctors have on average 5 minutes available for such a search. Furthermore, over 40% of searches do not yield the information required.

Unfortunately, copyright and licensing issues were an issue. Although many web pages are clearly unrestricted, some contain explicit copyright restrictions. In the case of American sites, it could be argued that what Khresmoi was doing was fair use for research as stated by the United States copyright law. However, this would not be the case if Khresmoi were to be commercially exploited. Additionally, other non-USA sites contained restrictive copyrights.

Download the full case study.