Crawling and Semantic Structuring of Scientific Publications in the Web

Crawling Manifest

If you are a website administrator you may find our crawling manifest helpful.


We create a combined, focused web crawling system that collects relevant documents from the web that is particularly suited for harvesting publications in the educational domain. The publications are converted into textual form, and we extract metadata (author, title, language, references etc.) from their structure. Semantic structure in the following dimensions will be induced:sections of publications, topic distribution, semantic clustering of the vocabulary, and extraction of a semantic network. These will be utilized for subsequent processing in the other projects, as well as for facetted search.


The Heritrix crawler ( will be extended with several components: Topic-focussed crawling with filtering on document types (cf. Peters et al. 2010), filtering on author’s homepages (cf. SeerSuite, Teregowda et al. 2010), RSS subscriptions on relevant portals (cf. Farooq, 2008), and polling of search engines for known publication titles from the references of relevant documents. Success of crawling is measured by Precision (how many of the found documents are relevant) and Recall (how many of known documents from existing databases have been found).

For conversion, we extend existing tools such as pdfToText and pdfToHtml as to reliably be able to extract metadata. Part of this functionality from SeerSuite will be used and adopted.

For semantic structure extraction, we employ a variety of Structure Discovery methods (Biemann, 2012), such as language models, topic models (cf, Riedl und Biemann, 2012), clustering and distributional similarity.


  • Pradeep B. Teregowda, Isaac G. Councill, R. Juan Pablo Fernández, Madian Khabsa, Shuyi Zheng, and C. Lee Giles (2010): SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In Proceedings of the 2010 USENIX conference on Web application development (WebApps'10). USENIX Association, Berkeley, CA, USA, pp.14-14.
  • Umer Farooq, Craig H. Ganoe, John M. Carroll, Isaac G. Councill, C. Lee Giles (2008): Design and evaluation of awareness mechanisms in CiteSeer. Information Processing & Management, 44(2), pp. 596-612.
  • Chris Biemann (2012): Structure Discovery in Natural Language. In G. Hirst, E. Hovy and M. Johnson (Series Eds.): Theory and Applications of Natural Language Processing, Springer Verlag Heidelberg, Dordrecht, London, New York.
  • Sybille Peters, Claus-Peter Rückemann, Wolfgang Sander (2010): A New Approach towards Vertical Search Engines - Intelligent Focused Crawling and Multilingual Semantic Techniques. In Proceedings of the 6th International Conference on Web Information Systems (WEBIST 2010), Valencia, Spain.

Project Publications

  • Remus, Steffen (2014): Unsupervised Relation Extraction of In-Domain Data from Focused Crawls. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden


The web contains a very large amount of scientific publications. This project is concerned with finding publications for a particular topic and in particular languages, and with methods that make their semantic structure explicit using data-driven methods of computational semantics. This structure enables various forms of retrieval on this data, and serves as a preprocessing step for further processing.

A A A | Drucken Print | Impressum Impressum | Sitemap Sitemap | Suche Search | Kontakt Contact | Webseitenanalyse: Mehr Informationen
zum Seitenanfangzum Seitenanfang