精华区文章阅读

发信人: ssos (存在与虚无·戒酒戒网), 信区: Algorithm
标  题: Web Content Mining
发信站: 哈工大紫丁香 (2001年10月29日18:27:10 星期一), 站内信件

The heterogeneity and the lack of structure that permeates much of the ever
ex
panding information sources on the World Wide Web, such as hypertext documen
ts
, makes automated discovery, organization, and management of Web-based infor
ma
tion difficult. Traditional search and indexing tools of the Internet and th
e
World Wide Web such as Lycos, Alta Vista, WebCrawler, ALIWEB [Kos94], MetaCr
aw
ler, and others provide some comfort to users, but they do not generally pro
vi
de structural information nor categorize, filter, or interpret documents. A
re
cent study provides a comprehensive and statistically thorough comparative e
va
luation of the most popular search tools [LS97].
In recent years these factors have prompted researchers to develop more inte
ll
igent tools for information retrieval, such as intelligent Web agents, as we
ll
as to extend database and data mining techniques to provide a higher level
of
organization for semi-structured data available on the Web. We summarize so
me
of these efforts below.
Agent-Based Approach
The agent-based approach to Web mining involves the development of sophistic
at
ed AI systems that can act autonomously or semi-autonomously on behalf of a
pa
rticular user, to discover and organize Web-based information. Generally, th
e
agent-based Web mining systems can be placed into the following three catego
ri
es:
1Intelligent Search Agents
Several intelligent Web agents have been developed that search for relevant
in
formation using characteristics of a particular domain (and possibly a user
pr
ofile) to organize and interpret the discovered information. For example, ag
en
ts such as Harvest [BDH94], FAQ-Finder [HBML95], Information Manifold [KLSS9
5]
, OCCAM [KW96], and ParaSite [Spe97] rely either on pre-specified and domain
s
pecific information about particular types of documents, or on hard coded mo
de
ls of the information sources to retrieve and interpret documents. Other age
nt
s, such as ShopBot [DEW96] and ILA (Internet Learning Agent) [PE95], attempt
t
o interact with and learn the structure of unfamiliar information sources. S
ho
pBot retrieves product information from a variety of vendor sites using only
g
eneral information about the product domain. ILA, on the other hand, learns
mo
dels of various information sources and translates these into its own intern
al
concept hierarchy.
2Information Filtering/Categorization
A number of Web agents use various information retrieval techniques [FBY92]
an
d characteristics of open hypertext Web documents to automatically retrieve,
f
ilter, and categorize them [CH97,BGMZ97,MS96,WP97,WVS96]. For example, HyPur
su
it [WVS96] uses semantic information embedded in link structures as well as
do
cument content to create cluster hierarchies of hypertext documents, and str
uc
ture an information space. BO (Bookmark Organizer) [MS96] combines hierarchi
ca
l clustering techniques and user interaction to organize a collection of Web
d
ocuments based on conceptual information.
3Personalized Web Agents
Another category of Web agents includes those that obtain or learn user pref
er
ences and discover Web information sources that correspond to these preferen
ce
s, and possibly those of other individuals with similar interests (using col
la
borative filtering). A few recent examples of such agents include the WebWat
ch
er [AFJM95], PAINT [OPW94], Syskill & Webert [PMB96], and others [BSY95]. Fo
r
example, Syskill & Webert is a system that utilizes a user profile and learn
s
to rate Web pages of interest using a Bayesian classifier.
Database Approach
The database approaches to Web mining have generally focused on techniques f
or
integrating and organizing the heterogeneous and semi-structured data on th
e
Web into more structured and high-level collections of resources, such as in
r
elational databases, and using standard database querying mechanisms and dat
a
mining techniques to access and analyze this information.
1Multilevel Databases
Several researchers have proposed a multilevel database approach to organizi
ng
Web-based information. The main idea behind these proposals is that the low
es
t level of the database contains primitive semi-structured information store
d
in various Web repositories, such as hypertext documents. At the higher leve
l(
s) meta data or generalizations are extracted from lower levels and organize
d
in structured collections such as relational or object-oriented databases. F
or
example, Han, et. al. [ZH95] use a multi-layered database where each layer
is
obtained via generalization and transformation operations performed on the
lo
wer layers. Kholsa, et. al. [KKS96] propose the creation and maintenance of
me
ta-databases at each information providing domain and the use of a global sc
he
ma for the meta-database. King & Novak [KN96] propose the incremental integr
at
ion of a portion of the schema from each information source, rather than rel
yi
ng on a global heterogeneous database schema. ARANEUS system [PA97] extracts
relevant information from hypertext documents and integrates the
se into higher-level derived Web Hypertexts which are generalizations of the
n
otion of database views.
2Web Query Systems
There have been many Web-base query systems and languages developed recently
t
hat attempt to utilize standard database query languages such as SQL, struct
ur
al information about Web documents, and even natural language processing for
a
ccommodating the types of queries that are used in World Wide Web searches.
We
mention a few examples of these Web-base query systems here. W3QL [KS95]: c
om
bines structure queries, based on the organization of hypertext documents, a
nd
content queries, based on information retrieval techniques. WebLog [LSS96]:
L
ogic-based query language for restructuring extracted information from Web i
nf
ormation sources. Lorel [QRS95] and UnQL [BDS95,BDHS96]: query heterogeneous
a
nd semi-structured information on the Web using a labeled graph data model.
TS
IMMIS [CGMH94]: extracts data from heterogeneous and semi-structured informa
ti
on sources and correlates them to generate an integrated database representa
ti
on of the extracted information.

--


<<社会契约论>>是一本好书,应当多读几遍
风味的肘子味道不错,我还想再吃它

※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: 202.118.230.220]

Algorithm 版 (精华区)