Fast and Scalable Pattern Mining for Media-Type Focused Crawling

Publication Type:

Conference Paper

Source:

KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning (2009)

URL:

http://lwa09.informatik.tu-darmstadt.de/pub/KDML/WebHome/kdml09_J.Umbrich_et_al.pdf

Abstract:

Search engines targeting content other than hy- pertext documents require a crawler that discov- ers resources identifying files of certain media types. Na ̈ıve crawling approaches do not guaran- tee a sufficient supply of new URIs (Uniform Re- source Identifiers) to visit; effective and scalable mechanisms for discovering and crawling tar- geted resources are needed. One promising ap- proach is to use data mining techniques to iden- tify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occuring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and dis- cuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an av- erage of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.

Notes:

* Non-Clique Member, Jointly funded by Lion-2 and Clique

Social Network Analysis Group

Search

User login

Subscribe to Clique

Sponsors