精华区文章阅读

发信人: ssos (存在与虚无·戒酒戒网), 信区: Algorithm
标  题: 精确内容检索中的概念索引
发信站: 哈工大紫丁香 (2001年10月29日18:34:42 星期一), 站内信件

Conceptual Indexing
- Key Ideas
- Benefits
- Examples
- Papers
- People
Improving your ability to find information online
Conceptual Indexing for Precision Content Retrieval
=====================================================
Can't find what you want?
How often have you failed to find what you wanted in an online search becaus
e the words you used failed to match words in the material that you needed?
Concept-based retrieval systems attempt to reach beyond the standard keyword
approach of simply counting the words from your request that occur in a doc
ument. The Conceptual Indexing Project is developing techniques that use kno
wledge of concepts and their interrelationships to find correspondences betw
een the concepts in your request and those that occur in text passages. Our
goal is to improve the convenience and effectiveness of online information a
ccess.
The Paraphrase Problem
The central focus of this project is the "paraphrase problem," in which the
words used in a query are different from, but conceptually related to, those
in material that you need. For example, in a collection of articles by Jame
s Fallows, then Washington editor of the Atlantic Monthly, the query, "chang
e in the deficit," results in several relevant passages including "Last year
's reductions in tax rates are part of the reason for the deficits, as are t
he administration's plans for a sustained military buildup." Based on this p
assage a user can then decide whether to read the rest of the article. While
the query and the passage convey similar ideas, the wording in each is diff
erent, a typical case of the paraphrase problem.
In addressing the paraphrase problem, three challenges must be met:
What information is required to connect the terms in a query to those in a r
elevant passage?
How can this information be organized and used efficiently?
To what extent can descriptions of the content of a document be automaticall
y extracted from the document itself?
Our approach to the paraphase problem is to identify and extract concepts (m
eaningful words and phrases), and relate similar concepts to one another. Th
is information is organized in a structured conceptual taxonomy. Presented w
ith a query, the technology searches through the taxonomy for similar, but n
ot necessarily identical ideas. With special retrieval algorithms, the taxon
omy can be scanned efficiently.
Elements of the Technology
The technology, which is called "Precison Content Retrieval," is composed of
two parts:
Conceptual Indexing
Builds a structured conceptual taxonomy of words and phrases extracted from
the indexed material
Specific Passage Retrieval
Finds specific passages and ranks them according to relevance to the query
Key Ideas behind the Technology
===============================
Making a difference
We have found that techniques from knowledge representation and natural lang
uage processing can make a useful contribution to solving the paraphrase pro
blem. By searching a structured conceptual taxonomy of the words and phrases
extracted from a collection of documents, our algorithms can effectively co
nnect terms in a query with appropriate related terms in document passages.
The problem with synonyms
A common approach to the paraphrase problem is to use tables of synonyms to
automatically expand queries by adding terms that are recorded as "synonymou
s." However, there are few real synonyms in English, so the common practice
is to include related words as if they were synonyms. However, treating term
s this way when they are not really synonyms introduces a level of granulari
ty that trades off precision for recall. There is no a priori correct level
for this tradeoff - different information needs require different levels of
generality - so this technique often degrades retrieval rather than improvin
g it.
As an alternative to synonym classes, we use taxonomic subsumption algorithm
s that exploit generality (subsumption) rather than synonymy to connect term
s in queries with passages that contain more specific terms as well as the r
equested terms. These algorithms do not automatically explore more general t
erms, so the level of generality is controlled by your choice of query terms
. For example, if you ask for "motor vehicles" you would get trucks, buses,
cars, etc., but if you ask for "automobiles" you would get cars and taxicabs
, but not trucks and buses.
Taxonomies
Using knowledge bases of general semantic facts, structured conceptual taxon
omies (a type of semantic network) can be constructed from words and phrases
. These words and phrases can be extracted automatically from text and parse
d into conceptual structures. The taxonomy can be organized by the most-spec
ific-subsumer (MSS) relationship, where each concept is linked to the most s
pecific concepts that subsume it - i.e., that are more general than it is. T
erms in a query are individually matched with corresponding concepts in the
taxonomy together with their subconcepts.
For example, given the general semantic facts that "washing" is a kind of "c
leaning" and "car" is a kind of "automobile", an algorithmic classification
system can automatically classify "car washing" as a kind of "automobile cle
aning". A query for "automobile cleaning" or "automobile wahing" will immedi
ately retrieve hits for "car washing".
Examples
========
color change
The conceptual taxonomy makes it possible to find specific concepts that are
subsumed by a general request. For example, in a taxonomy of bug-descriptio
ns, the query color change subsumes the concepts:
becomes black
reset bitmap colors
color disruption
This technology integrates general linguistic information with taxonomic sub
sumption. For example,
linguistic morphological analysis
disruption is derived from disrupt
lexical taxonomic subsumption
to disrupt is to damage
to damage is to change
Using this kind of knowledge to connect query phrases with phrases occurring
in text makes it much easier to find what your looking for.
SunExpress Catalog
Querying a conceptual taxonomy to find relevant phrases and then looking at
passages where those phrases occur can be an effective way to find what you
need. For example, with a conceptual index of the SunExpress Catalog, a cata
log of products for Unix(TM) computers, the query "add memory" returned the
following structure of subsumed concepts:
Query: (ADD MEMORY)
(ADD MEMORY)
|-k- (ADDITIONAL MEMORY)
| |-k- (ADDITIONAL A MEMORY)
| |-k- (ADDITIONAL G MEMORY)
| |-k- (ADDITIONAL K MEMORY)
| |-k- (ADDITIONAL STORAGE)
| |  |-k- (ADDITIONAL DISK)
| |    |-k- (ADDITIONAL DISKS)
| |    | |-k- (TWO ADDITIONAL HARD DISKS)
| |    |
| |    |-k- (ADDITIONAL MULTI-DISK)
| |      |-k- (ADDITIONAL 4.2-GB MULTI-DISK)
| |      |
| |      |-k- (ADDITIONAL SMCC MULTI-DISK)
| |
| |-k- (PURCHASE ADDITIONAL MEMORY)
|
|-k- (ECONOMICALLY ADDING LOCAL STORAGE)
|-k- (SOLDERED-IN MEMORY)
Each of these concepts (with the possible exception of the query phrase itse
lf) is a phrase that occurs somewhere in the SunExpress catalog and was auto
matically extracted and indexed. The display shows the organization of these
concepts according to their subsumption relationships, with more specific c
oncepts occurring lower and to the right. In particular the query subsumes t
he concept:
(ECONOMICALLY ADDING LOCAL STORAGE)
based on the fact that a disk is a kind of storage.
Looking up this phrase in the text results in a display of the following pas
sage (with the relevant phrase highlighted with italics):
Sun's new 535 MB 3.5-Inch SCSI-2 Disk Drive offers much higher performance
and more capacity at a lower cost/MB than the 424 MB drive it replaces. It i
s ideal for economically adding local storage to desktop SPARCsystems.
Benefits
========
Precision Content Retrieval provides users with three key benefits.
Specific passage retrieval -- finds specific passages of information content
that are responsive to queries by users
Intuitive ranking of hits -- produces a scored list of specific passages wit
hin documents
Conceptual navigation -- structured taxonomy is suitable for efficient brows
ing and navigation of concepts found in documents
Comparison with Traditional Information Retrieval
Crucial differences exist between traditional informational retrieval and pr
ecison content retrieval.
Traditional Information Retrieval:
Is designed for a scholar/analyst who wants comprehensive coverage
Retrieves entire documents
Is focused on coarse-grained, topics
Works best for large queries and targets
Depends on word counting
Precision Content Retrieval:
Is designed for those wanting quick access
Retrieves relevant passages within a document
Is intended for fine-grained, specific information needs
Works best for small queries and targets
Exploits knowledge of language and meaning
Papers
======
For more information on precision content retrieval, knowledge representatio
n, semantic networks, and structured conceptual taxonomies, see:
Ambroziak, Jacek and William A. Woods, "Natural Language Technology in Preci
sion Content Retrieval," proceedings of the International Conference on Natu
ral Language Processing and Industrial Applications (NLP+IA 98), August 18-2
1, 1998, Moncton, New Brunswick, CANADA. (reprint available online at: http:
//www.sun.com/research/techrep/1998/abstract-69.html).
Woods, W. A., "Conceptual Indexing: a better way to organize knowledge," Tec
hnical Report SMLI TR-97-61, Sun Microsystems Laboratories, Mountain View, C
A, April, 1997. (available online at: http://www.sun.com/research/techrep/19
97/abstract-61.html).
Kuhns, Robert J., "A Survey of Information Retrieval Vendors," Technical Rep
ort SMLI TR-96-56, Sun Microsystems Laboratories, Mountain View, CA, October
, 1996. (available online at: http://www.sun.com/research/techrep/1996/abstr
act-56.html).
Woods, W. A., "Finding Information on the Web: A Knowledge Representation Ap
proach," presented at the Fourth International World Wide Web Conference, Bo
ston, MA (December 1995) (For a summary, see: http://www.ai.mit.edu/projects
/iiip/conferences/www95/woods.html)
Woods, W. A. and James Schmolze, "The KL-ONE Family," Computers & Mathematic
s with App.lications, Vol. 23, Nos. 2-5, (January-March, 1992), special issu
e on Semantic Networks in Artificial Intelligence , Part 1, pp. 133-177. Als
o reprinted in Fritz Lehmann (ed.), Semantic Networks in Artificial Intellig
ence, Pergamon Press, 1992, pp. 133-177.
Woods, W. A., "Understanding Subsumption and Taxonomy: A Framework for Progr
ess," in John Sowa (ed.), Principles of Semantic Networks: Explorations in t
he Representation of Knowledge, San Mateo: Morgan Kaufmann, 1991, pp. 45-94.

Woods, W. A., "Important Issues in Knowledge Representation," Proceedings of
the IEEE, Vol. 74, No. 10 (October, 1986), pp. 1322-1334. Reprinted in Pete
r G. Raeth (ed.), Expert Systems: A Software Methodology for Modern App.lica
tions, Los Alamitos:IEEE Computer Society Press, 1990, pp. 180-204.

--


<<社会契约论>>是一本好书,应当多读几遍
风味的肘子味道不错,我还想再吃它

※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: 202.118.230.220]

Algorithm 版 (精华区)