<?xml version="1.0" encoding="UTF-8"?>
<XML><RECORDS>
<RECORD>
	<REFERENCE_TYPE>3</REFERENCE_TYPE>
	<AUTHORS>
		<AUTHOR>J. Umbrich*</AUTHOR>
		<AUTHOR>M. Karnstedt</AUTHOR>
	</AUTHORS>
	<YEAR>2009</YEAR>
	<TITLE>Fast and Scalable Pattern Mining for Media-Type Focused Crawling</TITLE>
	<SECONDARY_TITLE>KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning</SECONDARY_TITLE>
	<ABSTRACT>&lt;p&gt;Search engines targeting content other than hy- pertext documents require a crawler that discov- ers resources identifying files of certain media types. Na &Igrave;ˆ&Auml;&plusmn;ve crawling approaches do not guaran- tee a sufficient supply of new URIs (Uniform Re- source Identifiers) to visit; effective and scalable mechanisms for discovering and crawling tar- geted resources are needed. One promising ap- proach is to use data mining techniques to iden- tify the media type of a resource without the need for downloading the content of the resource. The idea is to use a learning approach on features derived from patterns occuring in the resource identifier. We present a focused crawler as a use case for fast and scalable data mining and dis- cuss classification and pattern mining techniques suited for selecting resources satisfying specified media types. We show that we can process an av- erage of 17,000 URIs/second and still detect the media type of resources with a precision of more than 80% and a recall of over 65% for all media types.&lt;/p&gt;</ABSTRACT>
	<NOTES><p>* Non-Clique Member, Jointly funded by Lion-2 and Clique</p></NOTES>
	<URL>http://lwa09.informatik.tu-darmstadt.de/pub/KDML/WebHome/kdml09_J.Umbrich_et_al.pdf</URL>
</RECORD>
</RECORDS></XML>