Fast and Scalable Pattern Mining for Media-Type Focused Crawling
Publication Type:Conference Paper
Source:KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning (2009)
Search engines targeting content other than hy- pertext documents require a crawler that discov- ers resources identifying files of certain media types. Na ̈ıve crawling approaches do not guaran- tee a sufficient supply of new URIs (Uniform Re- source Identifiers) to visit; effective and scalable mechanisms for discovering and crawling tar- geted resources are needed. One promising ap- proach is to use data mining techniques to iden- tify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occuring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and dis- cuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an av- erage of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.
* Non-Clique Member, Jointly funded by Lion-2 and Clique