RESEARCH | Publications

For the complete list, please visit my research profiles on the following websites:


I have been/currently am PI on the following projects

  • EPSRC Doctoral Training Program – Supervisor Led Project (Oct 2021 - Apr 2025, EPSRC), "Predicting the Spread and Damage of Hate Speech for Effective Prevention and Intervention of Cyberhate". Collaborator: Rotherham United Football Club
  • Data Science powered healthcare supply chain network monitoring system in the post-COVID and post-Brexit industrial landscape (May 2021 - Apr 2022, Innovate UK). Partner: Vamstar Ltd
  • AI-powered real-time healthcare supplier profile and COVID-19 supply risk matrix (Feb 2021 - Jan 2022, Innovate UK). Partner: Vamstar Ltd. See our featured article in the media
  • DoubleTapp: Crowdsourcing the Long Tail of Nano-influencers (Nov 2020 - Oct 2021, Innovate UK), partner: DoubleTapp Ltd.
  • Towards a Big-Data Driven Approach to Tackling Urban Waterlogging - A Scoping Study (Jan 2020 - Apr 2021, GCRF Networking Grants)
  • EPSRC Doctoral Training Program – Supervisor Led Project (Oct 2018 - Apr 2022, EPSRC), "Mining health information on the Social Web – towards an understanding of the influence of social media on public healthcare". Collaborator:
  • Early Detection of Cyber Hate on Social Media for Crime Prevention (PI, June - August, 2017, Nullfield Foundation, UK)
  • KTP Web Mining for Just Giving Ltd (Oct 2014 - Jan 2015, JustGiving Ltd.)


RESEARCH | Professional Services


My PhD research focused on exploiting background knowledge from various resources to support supervised Named Entity Recognition - a fundamental task of Information Extraction that extracts named entities from unstructured texts. For details, see Named entity recognition: challenges in document annotation, gazetteer construction and disambiguation.

I am open to supervise students interested in any of the research topics listed at the top of this web page. I have supervised the following students to successful completion:

  • 2018-22 Zhixue (Cass) Zhao, 'Using Pre-trained Language Models for Toxic Comment Classification'
  • 2018-22 Jenny Hayes, 'The use of social media for sousveillance'

TEACHING | Modules

I teach the following modules on taught MSc programs in the school

  • INF6027 Introduction to Data Science
  • INF6024 Researching Social Media

Other modules I have taught in the past include

  • Data Visualisation, Introduction to Programming, Information Systems Project Management, Information Systems Modelling, Information Systems in Organisations

TEACHING | PhD students

PhD candidates: I am interested in supervising PhD students in the following topics (please also read my profile at the top of the page). If you have an idea, please feel free to email to discuss it. Note that you need strong programming knowledge and skills, and it is desirable that you have knowledge in at least one of the areas of: machine learning, natural language processing, data mining, text mining, statistics

  • Semantic Web, linked data
  • Information extraction, textmining, natural language processing
  • Social media data analytics, predictive analytics
  • Data mining in other disciplines, such as health, and bibliometrics

I am examiner for the following PhD students:

  • December 2022: Moritz Walter (chemical toxicity prediction), Information School, University of Sheffield
  • June 2022: Anastasios Lytos (argumentation mining),Department of Computer Science, University of Sheffield
  • Aug 2021: Ruizhe Li (topic modelling and dialogue), Department of Computer Science, University of Sheffield
  • Aug 2020: Jun Zhang (smart city technologies), Informatin School, University of Sheffield


Video Lecture profile page

Please contact me for detailed slides and/or content.

  • Invited talk at the 'UiTM Global Webinar on Data Science' (2021): Data Science Through the Lens of Text Mining
  • Invited talk at the 'CounterBalance Seminar Series' (2020) organised by the Santa Fe Institute.
  • Guest lecture for the Computing and Technology Research Showcase: Big data and how it is relevant to me. (2016)
  • Talk at the NTU School of Science and Technology research seminar: Automatic Knowledge Base Construction Using Text Mining. (2016)
  • Invited talk at Schwa lab, the University of Sydney. Aligning relations on Linked Data (2013)
  • Tutorial at ISWC2013 Web Scale Information Extraction: Gentile, A., Zhang, Z.
  • Tutorial at ECML/PKDD2013 Web Scale Information Extraction: Gentile, A., Zhang, Z.
  • Tutorial at ECML/PKDD2011: Ciravegna, F., Varga, A., Zhang, Z. 2011. Mining Complex Entities from Heterogeneous Information Networks, in 22th European Conference on Machine Learning (ECML) and the 15th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 
  • Tutorial at EKAW2010: Zhang, Z., Cano, E., Elbedweihy, K., Dadzie, A. 2010. Introduction to Knowledge Acquisition from Social Networking Sites, in the conference on Knowledge Engineering and Knowledge Management by the Masses, EKAW2010.


I have a GitHub webpage for sharing datasets I used for research. These cover research in the areas of terminology extraction, ontology mapping, entity linking, scholarly data linking, Tweet classification, and procedural knowledge extraction.

I am also the creator and contributor of a number of open source research software listed below.

JATE - Java Automatic Term Extraction library

JATE is the most extensively used library for state-of-the-art automatic term extraction (ATE). It can be used for benchmarking ATE algorithms, developing glossaries and supporting a wide range of Natural Language Processing tasks such as ontology engineering and machine translation. It also provides a generic development and evaluation framework for developing new term extraction algorithms.

The most recent, stable version is JATE 2.0, released under the LGPL license on GitHub.



  • Zhang, Z., Gao, J., Ciravegna, F. (2016). JATE 2.0: Java Automatic Term Extraction with Apache Solr. Proceedings of the Tenth International Conference on Language Resources and Evaluation

Semantic Table Interpretation (STI)

The project implements state-of-the-art semantic table interpretation algorithms, which take as input relational tables, and creates three types of semantic annotations on the table: class for a table column; named entity for table cells; and relations between columns. It is currently hosted on GitHub and has been adapted to support a number of research projects such as Odalic


  • Zhang, Z. (2017). Effective and Efficient Semantic Table Interpretation using TableMiner+. Semantic Web Journal. 8 (6) (in print)

ScholarlyData Link Discovery

The project implements state-of-the-art machine learning based link discovery/instance matching algorithms for linked data. It contains five well-known algorithms which are tested on a task of record deduplication for the project. The code is currently hosted on GitHub



  • Zhang, Z., Nuzzolese, A., Gentile, A. (2017). Entity deduplication on ScholarlyData. In Proceedings of the Extended Semantic Web Conference, pp85-100