Crawling and Semantic Structuring of Scientific Publications in the Web

We create a combined, focused web crawling system that collects relevant documents from the web that is particularly suited for harvesting publications in the educational domain. The publications are converted into textual form, and we extract metadata (author, title, language, references etc.) from their structure. Semantic structure in the following dimensions will be induced:sections of publications, topic distribution, semantic clustering of the vocabulary, and extraction of a semantic network. These will be utilized for subsequent processing in the other projects, as well as for facetted search.


The Heritrix crawler ( will be extended with several components: Topic-focussed crawling with filtering on document types (cf. Peters et al. 2010), filtering on author’s homepages (cf. SeerSuite, Teregowda et al. 2010), RSS subscriptions on relevant portals (cf. Farooq, 2008), and polling of search engines for known publication titles from the references of relevant documents. Success of crawling is measured by Precision (how many of the found documents are relevant) and Recall (how many of known documents from existing databases have been found).

For conversion, we extend existing tools such as pdfToText and pdfToHtml as to reliably be able to extract metadata. Part of this functionality from SeerSuite will be used and adopted.

For semantic structure extraction, we employ a variety of Structure Discovery methods (Biemann, 2012), such as language models, topic models (cf, Riedl und Biemann, 2012), clustering and distributional similarity.


Project Publications

  • Remus, Steffen (2014): Unsupervised Relation Extraction of In-Domain Data from Focused Crawls. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden


The web contains a very large amount of scientific publications. This project is concerned with finding publications for a particular topic and in particular languages, and with methods that make their semantic structure explicit using data-driven methods of computational semantics. This structure enables various forms of retrieval on this data, and serves as a preprocessing step for further processing.

