精华公布栏

发信人: elife (有心·就会成功), 信区: Npsos
标  题: Microsoft? .NET Speech Platform
发信站: 哈工大紫丁香 (2003年05月29日21:39:08 星期四), 站内信件

Microsoft? .NET Speech Platform:
Making Speech Mainstream
White Paper
Published: June 2002

Contents

Executive Summary   1
The Interaction Challenge   2
The Role of Speech  3
The Business Case for Speech    4
Benefits of Implementing Speech 5
Challenges to Implementing Speech   8
Microsoft's Vision for Speech   8
SALT Architecture   9
Microsoft .NET Speech Platform and Tools    11
Beyond Telephony: Multimodal Computing  11
Conclusion  12

Executive Summary
The growth of e-business over the past five years has led to an explosion in
the number of interactions a company has with its customers. A company must
now plan for interactions not only through a call center but also through
multiple touch points including the Web, cell phones, PDAs, and e-mail.
Indeed, enterprises with successful e-commerce Web sites are scrambling to
back these sites up with increased and improved personalized service.

Speech recognition can significantly enhance interactions with customers and
partners as well as among employees. Speech recognition delivers multiple
benefits-cost reduction in the call center, revenue enhancement for
e-commerce and productivity when used with an enterprise portal. Built on
Microsoft? .NET technologies, the Microsoft .NET speech platform is a
scalable speech recognition-based platform that can be deployed on standard
PC hardware and one that takes advantage of the Web programming model for
application development. Microsoft .NET speech platform delivers the
following benefits:

"   Lower cost of application development and deployment through the use of
open standards.
"   Ease of application development via the Web programming model. Speech
interfaces can be developed using familiar markup-style development language.
"   Lower investment in training and maintenance since existing Web
developers can be used for speech application development.

Further, Microsoft is driving the development of multimodal platforms.
Multimodal refers to the next-generation interface technology that enables
communication with the user through different modalities including test,
images, and speech, as appropriate for a specific application and device.
Multimodal technology enables an enterprise to create rich user interactions
without needing to create separate applications for each mode.

The Interaction Challenge
Over the past five years, enterprises have invested heavily in a variety of
e-business initiatives including informational Web sites, transactional
e-commerce sites, Web-based applications, and partner- and internal
employee-facing portals. In doing so, enterprises have successfully expanded
the number of channels through which they can sell their products and the
number of modes used to interact with customers, partners, and employees.

"   Customer interaction costs are increasing. Before the Internet, external
interactions (i.e., with customers and partners) happened via only one
mode-the phone. With the growth of e-commerce, customers and partners now
interact with an enterprise via phone, Web, and e-mail. In fact the rise of
e-commerce has done little to shift call center volume toward more-economical
modes such as e-mail and the Web. If anything, call centers have become the
premier support mechanism for successful e-commerce sites. In response,
enterprises are increasing investments in customer call centers, primarily
through adding more call center agents. However, while adding call center
agents enables enterprises to handle phone requests, call centers cannot
effectively provide a consistent service and support capability across the
various modes mentioned above. In addition, increasing call center agents to
handle increasing volume is simply not economically viable.
"   Employee interaction systems are failing to keep up. Creating an
effective interaction solution for company employees is also a tremendous
challenge compounded by the increasing use of a variety of mobile devices by
employees. Increasingly, employees rely on mobile devices to access
enterprise messaging systems and applications, requiring companies to invest
in adding mobile capabilities to their applications. Despite significant
investments, productivity gains have been elusive, primarily due to the lack
of better input capability on many mobile devices.
"   Telephony and the Web Are Converging. A key trend all IT managers must
consider is the convergence of Web and telephony. This is being driven by two
factors:
1.  The availability of mobile devices capable of browsing the Web. Devices
such as PDAs that started out as mobile devices for accessing the Web have
also added voice capability. Similarly, phones that were primarily voice
devices now enable users to browse the Web and access enterprise applications
and messaging systems. By 2004, analysts expect more users to access Internet
and Web-based services via mobile devices than via PCs. IT managers must
therefore invest to support rich user experiences across these devices.
2.  The evolution of wireless networks to 2.5G and 3G. Use of wireless
networks for Internet and applications access has been thus far hampered by
the low data throughput. Next-generation wireless networks will make
broadband speeds possible, making these networks a viable option for
accessing robust Web-based applications and services.

Why focus on enhancing interactions? The
e-business investments made in the past few years have enabled rich
interactions but are limited to those using PCs. As shown above, customers,
partners, and employees are interacting across multiple channels and modes.
Providing consistent levels of interaction across these modes will have a
direct impact on key business metrics such as customer satisfaction and
retention, and productivity. Ultimately, a company's ability to handle
interaction complexity in a scalable manner may well define its competitive
position. As the interaction complexity and scale requirements increase,
enterprises must look to technologies that will enable them to deploy highly
effective interaction systems.
The Role of Speech
Speech recognition is a critical element in addressing the interaction
challenge mentioned above. Over the past several years, advances in speech
recognition have increased the accuracy level to over 90 percent. In
addition, the power of microprocessors has increased tremendously. As a
result, automated speech recognition (ASR)-based systems are now able to
handle a large number of complex queries, making them a viable solution for
businesses.

Speech recognition has already been widely deployed. Most telecommunications
carriers use speech recognition for handling 411 queries and directory
assistance; Sprint PCS provides a voice-activated dialing service that uses
speech recognition. Banks and brokerage houses use ASR to deliver stock
quotes and account information.

Following are the most common examples of speech-based applications:

"   Call center. Speech recognition is employed in the call center to handle
routine and most common inquiries (e.g. shipping information and stock
quotes).

"   E-commerce. An e-commerce Web site provides its users with a rich set of
information. Visitors to e-commerce sites can place and track orders, view
shipping information, and navigate through an extensive catalog. Enabling the
same functionality over a non-speech-based (i.e. touchtone) interactive voice
response (IVR) system is nearly impossible-while IVR handles scale very well,
it fails to handle complexity. The rich information available to users
browsing the Web can be extended to users on the phone by speech-enabling Web
sites.

"   Voice portals. Voice portals are speech-enabled Web sites intended for
use by business partners and employees. In the case of employees, the voice
portal should be an extension of an existing enterprise intranet. Employees
can have voice-based access to enterprise applications such as sales force
automation as well as their personal information such as 401K status, payroll
records, etc.

The Business Case for Speech
Implementing speech can have a significant impact on a company's business.
Depending on the context in which speech is being considered, implementing
speech recognition can yield benefits in one or more of the following areas:

1.  Cost reduction. Over the past few years, enterprises faced with
increasing call center costs have turned to speech recognition to achieve
cost reduction in two areas:
a.  The number of call center agents
b.  The cost per call

Speech recognition will never eliminate the need for live call center agents.
However, it does make it possible to handle increasing call volume without a
linear increase in the number of call center agents. Most call center agents
handle very common inquiries. For example, call center agents at a large bank
may, for the most part, answer inquiries regarding account status that could
be automated using speech technologies. In addition, by reducing the length
of calls, enterprises can reduce the cost per call. This means that by
automating such common requests, an enterprise call center can continue to
provide a high level of customer service and satisfaction while maintaining
control on call center costs. Additional benefits of implementing speech in a
call center include:

"   Reducing call and hold times
"   Increasing the number of calls a call center can handle with the same
number of agents
"   Reducing queue time by allowing users to get their own information
"   Reducing usage of ports dedicated to an IVR system by speeding up caller
interaction

2.  Revenue enhancement. Savings from the reduced need for call center agents
can be applied toward more revenue-enhancing efforts. These agents can be
either diverted to handle direct revenue generation functions or repurposed
to focus on the highest value (most profitable) customers, who require a
high-touch approach. For example, by implementing speech recognition in the
call center, a large financial services company has been able to retrain and
redeploy its customer service representatives to provide high-value advice
that directly results in revenue generation. Increased customer satisfaction
will also contribute to higher revenues. Enterprises can leverage speech
recognition to handle complex requests over the phone and expose more
services to their customers. This rich experience increases customer
satisfaction and retention, leading to higher revenues.
3.  Productivity and business agility. Multiple locations, mobile employees,
and increasing use of portable devices such as PDAs and cell phones lead to
demand for anytime, anywhere access to enterprise information. When logging
into the corporate network via a graphical user interface is not possible,
enterprises can improve employee productivity via a voice portal-a
speech-recognition-based service that enables employees to log in and access
applications through spoken prompts and commands. Examples of resources that
could be accessed through telephony-enabled applications include e-mail, 401K
information, and meetings and schedules.

Benefits of Implementing Speech

    Speech-enabled applications
    Call Center E-Commerce  Enterprise Voice Portals
Customer Satisfaction   Shorter call wait times B2C portals can provide
richer information to phone customers
Customer retention  Complex inquiries quickly and reliably handled  Anytime,
any-mode access delivers a better value proposition
Revenue Enhancement Agents repurposed toward revenue-generating activities
Additional revenue opportunities due to expanded access
Business Agility            Speech enabled B2B portals provide expanded access
Productivity            Expanded access to information and messaging
Cost Reduction  Fewer agents and lower cost per call        Common tasks
(such as 401K info) can be automated

Challenges to Implementing Speech
While the return on investment for speech is clear, implementing speech today
is expensive. Speech recognition is typically implemented as part of a voice
response system. When implementing speech, enterprises must consider three
critical factors:
1.  Cost of hardware and software. Companies implementing speech-based IVR
face an average fully loaded cost (hardware, software and services) of
between $2,000 and $3,500 per port. This is primarily because of the cost of
specialized hardware and software.
2.  Cost of moves, adds and changes (MACs). The high cost of MACs has plagued
the industry for a number of years. This is primarily due to the fact that
most speech recognition-based IVR systems do not take advantage of open
standards.
3.  Investment in training and "portability" of developers. Additional costs
are incurred from the fact that the development environment for speech
recognition-based IVR systems (and IVR systems in general) is
vendor-specific. Developers must be trained for specific systems and often
cannot be repurposed for other projects. This lack of portability
dramatically increases the total cost of ownership of speech. Further,
trained developers are hard to find.
Microsoft's Vision for Speech
The speech industry today is akin to the PC industry 30 years ago, when the
industry consisted primarily of mainframes and only a few large companies had
the financial wherewithal to buy and support such systems. While some
advances in the use of standard PC hardware for speech systems has been made,
the industry is still hampered by the following:

1.  Proprietary hardware and software
2.  Lack of software standards
3.  Lack of a large developer base
4.  Lack of broad distribution for speech products

All these factors combine to make adding speech a very expensive proposition.

Microsoft's vision for speech centers around delivering a speech recognition
platform that will make speech a truly mainstream application. Microsoft is
focusing its efforts on tools and technologies that will let enterprises add
speech to Web sites and Web-enabled applications, thereby extending rich Web
site functionality to non-PC users. In doing so, Microsoft is addressing key
customer needs such as:

1.  Lower cost of deployment
2.  Lower cost of application development and integration
3.  Ease of application development
4.  Developer portability
Microsoft is delivering these benefits via five key initiatives:

1.  Open standards. Speech Application Language Tags (SALT) is a set of
lightweight extensions to HTML and XHTML that make it possible to add speech
to Web services. SALT is an emerging standard backed by leading companies
that provides a better alternative to voiceXML.
2.  Industry partnerships. In October 2001, Microsoft launched the SALT Forum
along with industry leaders such as Cisco, Intel, Philips, SpeechWorks, and
Comverse. Since then a further 18 Contributors have joined the Forum. The
SALT Forum's goal is to advance the standard for speech with industry
participants including telephony application vendors and voice hardware
platform and device manufacturers.
3.  Common platform. The Microsoft .NET speech platform is based on the same
Microsoft .NET technologies that are being deployed by enterprises in other
areas. The benefit of having a single platform for Web and voice applications
is a significant reduction in costs related to application development and
maintenance.
4.  Developer excitement. To date, the application of speech has been
hindered in part due to the limited availability of trained developers.
Microsoft is delivering a speech platform based on widely adopted Microsoft
.NET technologies, bringing the Web programming model via Microsoft Visual
Studio? .NET to speech application development, and making significant
investments in training its more than 6 million developers in the area of
speech application development. This will enable enterprises to draw from a
large base of trained developers, thereby lowering their costs. In addition,
since the applications are based on familiar Web programming models, Web
developers can be retrained to develop speech user interfaces.
5.  Great applications. Microsoft is actively working with the ISV community
to develop SALT-based applications. Over the next few years, applications
that are ready to take advantage of speech will become readily available.

Microsoft Speech Offering Answers Customer Needs
    Lower Cost of
Deployment  Lower Cost of
Development Ease of
Development Portability
Open Standards
Industry Partnerships
Common Platform
Developer Base
Applications

SALT Architecture
Microsoft has greatly simplified the development of speech user interfaces
for Web-based applications through the use of the Speech Application Language
Tags (SALT) developed by the SALT Forum. SALT is a small set of XML elements
that apply a speech interface to a document using HTML, enabling developers
to add a spoken dialog interface to Web applications and Web services. Web
application developers can use SALT equally effectively with HTML, XHTML,
CHTML, and WML. Using SALT, the applications can be written to support both
voice-only and multimodal browsers.

SALT is a great solution for adding speech because these tags leverage the
scripting and event model inherent in HTML to implement the interactive flow
with the user. Using SALT has a number of benefits:
"   Reuse of application logic. Because the speech interface is a thin markup
layer that applies presentation logic only, the code used for the business
logic of the application itself can be used across different modalities and
devices without modification.
"   Rapid development. Mastering SALT is a rapid process because there is
little new information to be learned. Developers can use existing Web
development tools for the development of SALT applications.
"   Speech+GUI. The simple addition of speech capabilities to the visual page
provides a way of instantly creating multimodal applications, either from
existing Web applications or from scratch. In other words, enterprises do not
have to discard their investments in the development of Web applications.
There are four key components to speech-enabling a Web application using SALT:
1.  Web server. The Web server generates Web pages containing HTML, SALT, and
embedded script. The script controls the dialog flow for voice-only
interactions.
2.  Telephony services. The telephony server connects to the phone network.
The server incorporates a voice browser interpreting the HTML, SALT, and
script.
3.  Speech services. The speech platform recognizes speech and plays audio
prompts and responses back to the user.
4.  Client device. For voice-only applications a phone is all that is
required. For multimodal applications, clients include a Pocket PC or desktop
PC running a SALT-enabled version of Microsoft Internet Explorer browser
software.
Microsoft .NET Speech Platform and Tools
Microsoft provides a speech application platform and related tools that will
enable enterprises to deploy SALT-based speech applications. The speech
application platform includes the following:
"   Microsoft .NET Speech SDK. The SDK is a set of developer tools, samples,
and documents that allow Web developers to create, debug and deploy
speech-enabled applications. The speech-authoring environment is seamlessly
integrated into Visual Studio .NET, allowing the developer to leverage the
strengths of Visual Studio .NET such as toolboxes, graphical development
environment, and multiple views. Visual Studio .NET includes the ASP.NET
speech controls, prompts, and grammars that can be used with any other
ASP.NET controls to speech-enable Web pages.
"   Speech application servers. The server software supports standard
telephony hardware and can be deployed on standard Intel-based servers,
significantly lowering the cost of deployment. It consists of two components:
    Telephony server software allows users to interact with HTML and
SALT-enabled Web applications from any phone. The software will run on
standard Windows-based hardware and uses a telephony board to connect to the
telephone network
    Speech server software that connects to the telephony server on a LAN and
serves multiple speech recognition and prompt/text to speech playback
requests for the telephony server. It will also be able to serve speech
recognition and audio prompt playback requests to Pocket PCs and other
devices. The platform can be scaled by simply adding more speech servers in a
"speech farm" environment, very similar to the deployment of Web servers in a
Web farm.
"   Client software for PCs and Pocket PCs running the SALT add-on for
Internet Explorer or Pocket Internet Explorer that enables these devices to
connect to a remote speech platform using an IP connection. This means that
users can speech-enable devices that are not otherwise capable of running
speech software locally by installing a small add-in telephony server.
Beyond Telephony: Multimodal Computing
Multimodal computing refers to the ability for users to input data in various
forms-speech, keyboard, and pen-and receive a response in the mode most
appropriate for the application and the device. For example, a customer using
a multimodal device would be able to speak into the device ("show me quotes
for Microsoft, IBM, …") and receive an answer that is displayed on the
device.

The emergence of converged mobile devices such as PDAs that allow users to
make voice calls and cellular phones that allow users to access Web-based
applications and services, combined with the rapid deployment of wireless
broadband networks creates new opportunities for enterprises to deliver a
rich interaction experience. Already the number of wireless phones equals the
wireline installed base (approximately 1 billion). As the capability of
wireless networks improves, industry analysts expect wireless devices to
consume a greater portion of voice and data traffic.

Why consider multimodal technology? As shown in the earlier part of this
document, enriching the user experience will have significant impact on a
company's fundamental business objectives. Multimodal technology delivers the
following benefits:
"   End users. Multimodal applications enable common behavior and services
across all devices. This means that end users experience rich interactions
using whatever mode they wish. Speech interaction in a multimodal environment
provides the end user with a natural, easy to use mode of input without
sacrificing the informational benefits of a visual output. As a result,
enterprises will realize greater customer satisfaction and increased usage.
"   Enterprises. Enterprise will benefit from a single deployment
infrastructure that spans Web and voice applications. This approach leads to
significant costs savings in development, deployment, and maintenance of
applications.
"   Developers. The use of familiar Web development techniques and tools in
the platform makes it easier for developers to speech-enable applications.
Conclusion
Speech is an essential element in the road to multimodal computing, and the
Microsoft .NET speech platform and products provide a smooth migration path.
All the tools needed for multimodal computing are in place: The SALT standard
described above is designed from the ground up to enable development and
deployment of multimodal applications. SALT-enabled Internet Explorer and
Pocket Internet Explorer browsers will be widely available, creating a large
base of devices that are capable of rendering multimodal applications.
Additionally, developers will be able to speech-enable existing or new
applications via standard development tools such as Visual Studio .NET.
Enterprises that license Microsoft speech products will find a smooth
transition to multimodal computing, and will be able to retain their existing
investments in speech application development.

? 2002 Microsoft Corp. All rights reserved.

The information contained in this document represents the current view of
Microsoft Corp. on the issues discussed as of the date of publication.
Because Microsoft must respond to changing market conditions, it should not
be interpreted to be a commitment on the part of Microsoft, and Microsoft
cannot guarantee the accuracy of any information presented after the date of
publication.

This document is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

Microsoft and Visual Studio are either registered trademarks or trademarks of
Microsoft Corp. in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the
trademarks of their respective owners.

--
   ◢◤ ▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
   █            \|/◣            +                           ▏
   ◥◣       ╲╲|╱       那点点的星光  +          ◣◢     ▏
    |     ◤╲\\╱/╱_◣ 是你微笑的声音，悠远、轻盈◥██◤   ▏
    |_____◣╱//◤ \\╲  _________+________________◢◥█◣_◤
                ▕◥   ◢

※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: 210.46.72.211]