Monday, September 10, 2012

Talking and Listening Technology : Media & Entertainment ...

August 20, 2012, Churchill Club, Palo Alto, CA?A diverse panel of technocrats participated in a panel discussion on the future of speech in interfacing with technology. Quentin Hardy from the New York Times moderated the panel. Panelists included Sheryl Connelly from Ford Motor Company, Ron Kaplan from Nuance, Dan Miler from Opus Research, and Steve Wozniak from Fusion IO.

Voice changes the interface with devices and is on ongoing trend. The history of human relations with computers is one of increasing intimacy. We started with punch cards and patch panels, then moved to a command line interface. After Xerox invented the mouse and GUI, we got icons on the screen and could behave like a two-year old child by pointing at something. Now we are moving past touch to a changed relationship with the computer.

Although the latest touch interfaces are easy to learn and use, they still are childish. What we really want is more understanding from our computers, which may come through voice recognition technologies. A voice input is more natural than other current input types and makes the machine seem more lifelike and personal. Consider the computer in Star Trek or the malevolent Hal in "2001 A Space Odessey".

Computers are moving into all areas of our lives and increasingly, they are in the cloud. These changes imply access to infinite storage and compute resources and will change civilizations. The potential for a universal translator could reduce frictions in business and political interactions as well as changing the industrial landscape.

The potential for instantaneous contacts is something that appeals to many segments in business. Some large companies like Apple, Nuance, Google, etc. are working on voice interfaces. There are good and bad aspects of these developments. Are the benefits over promised? What about ownership and security? Is this technology going to be too disruptive?

Forecasts versus reality?
Connelly offered consumer insights on trends and indicators. To many, voice is a buy versus demand issue. Consumers seem to want the technology when it is easily accessible. Voice increases access, but we need to also consider information addiction. More options increase our influence on the world and on our spontaneity, but just in time information may become an insatiable appetite.

For example, Blackberrys are considered on-line oxygen for its users. As the market grows, we need to consider more issues like security. Universal on-line access empowers and engages people to change. Increased spontaneity will reduce poverty over time. In some areas, information becomes a status symbol, and makes its owner a go-to person for access to that information.

At the same time, we don't understand the psychological effects of getting information. Studies have noted changes in brain function and activity, based on computer inputs, and voice adds to the access and interactions.

Increased velocity of information?
Kaplan stated that voice technologies have been promising since the '70s. PARC enabled computing to become more personal, analog computing to deliver complex and diverse array use of features. The technology, however, was not accessible to the average person. There is a danger that an increasing number of functions can overwhelm the user with complexity.

Therefore, we have to apply computerization to create the illusion of simplicity. Simplifying interfaces, however, is not easy. When Xerox created windows, mouse, and icons for a graphical user interface, they made it seem easy to get to the underlying features without a lot of complexity. It took Apple 10 years to move these interfaces to a computer. The user interface organized information that had held a lot of complexity. The user only had to deal with icons which presented the underlying functions and the associated computer commands to access those functions.

Connolly interjected that a conversational user interface, using ordinary language to get to desired results, is a natural and easy way to interact. What is really wanted is a personal assistant, a friend, to help out in things.. The GUI was okay for the last 20 or 30 years, but now has become limited.

A command is not the same as intention?
Kaplan continued by noting a verbal interface should provide a "what I want" versus "what I say" capability. Most information transfer does not involve verbalization. Seven years ago, voice recognition systems were like talking to your one-year-old. Now, this has evolved to talking to a five-year-old, and is starting to get more interaction. Perfection in the interface is not required it's more important to get easy and natural interactions, like talking to your spouse who may register up to 80 percent of what you say.

Why move to voice?
Miller responded a mixture of hope and hype. The various applications possibly include conversational commerce, people to people, people to machine, and even machine to machine interactions. The technology is getting better but there are still gaps. In the last eight or nine years, word and gesture interfaces have been getting better. There is lots of energy in the various recognition areas.

Recombinatory communications have been helped by moves from client/server architectures to a Web base, and everything is becoming much more simplified. Developers can change functions more easily when the software is not on individual machines. We are in a space where expectations create their own demands, and these demands are limited only by the imagination of users.

Wozniak added that everyone prefers a light interface. Apple has been trying to keep the interface as natural as possible. They worked hard to make things easy to use, and put in the software to make a machine work like a human for the Apple II. A personal feeling is to get things that are what I want, devices that make life nice.

Now government and military organizations are using consumer technology for their internal applications. These technologies are familiar to the users. These terminologies moving to the desktop and the devices become closer to the people, involving touch, hearing, and robust speech. The speech functions have to employ the redundancies and other human characteristics for communication to improve the human computer interface.

Siri has been described as a dodo?
Wozniak responded that before Apple, Siri did well and gave results in expected formats. Now it is bad information and is not natural.

Is Siri developing a knowledge base?
Caplan answered that they are enabling technology advances and are working on enabling intent through lots of users and their experiences.

Consequences of the change in interfaces?
Miller noted that there is no question about the many consequences. Most users are just trying them but they are not necessary for the computer.
Wozniak added that technology should know how to sense intent. People want things, and just have to know how to use them, but don't need to understand the underlying technology.

Voice in cars?
Connolly stated that vendors must understand the context of interface. For phones, the standard use model is to hold the phone in both hands. For in vehicle use, if a person still needs to have connectivity, it has to be without a screen or other normal phone functions. Because they are not experts in phone systems design, they partnered with leaders like Nuance and Microsoft to make the Bluetooth technology work for people.

Connected devices in the cloud?
Connally noted that young people prefer phones over cars, because the extra devices increase their productivity more than just talking. The phone can pair through Bluetooth with any other devices. This changes the way people interact with their cars. For example, the system could read incoming text messages through the radio.

This is not really a new way to do this, talking back is part of interface has been around for a while?
Miller offered that now the ads will talk to us. This is one of the differences between Apple and Facebook, to recognize that interactive ads on the phone, it must also understand location, time, and current activities. If this function is convenient, people will use it. Otherwise, they'll ignore it.
Connolly added that acceptance will require entertainment, information, and education as a reward for the interaction.
Wozniak appended personal ads and location information will matter.
Miller then suggested that the cloud will enable natural language processing for much more accurate speech recognition. Developers will need hooks and APIs etc. to carry out commerce. At some point, the systems will infer intent so that ads can provide information that leads to a transaction.
Kaplan opined the phone in the car enables communications and becomes a universal interface to many devices. Many devices exist in the market, people don't know how to use them together, due to the high levels of complexity. Universal indications, for example, could interface with your home thermostat, but no one wants to get an ad from the thermostat.
The context is critical, so a relevant message might be "chicken is ready to eat" from your oven when you're commuting home.
Miller offered the concept of universal communications with tentacles versus single use functions. The context has to be identified to be able to fix problems.

Lots of group force compute?
Connolly agreed that some parts of the car, from the cell phone into connecting to other devices, to require a lot of computer power. But just as tablet technology has drawn from computers, the convergence of technologies changes the nature of companies and makes competitors change.
Wozniak said that Detroit is starting to move to Silicon Valley because of this need to link to technology creators.
Connolly admitted that Ford is opening an office in Silicon Valley because the need to be here to change their own business. Voice brings about an intimacy with technology that creates the illusion of personalization. A study at MIT looked at what it means to society and one of its findings was that people use technology to escape loneliness.

A technology trick to pretend understanding?
Connolly mentioned that society is not the same as loneliness. As a move further along this path, what happens to quiet spaces and introspection?
Wozniak noted that quiet people will still do introspective things.

Technology has to be good. To start, we need the cloud, a good user interface, and?
Wozniak said we have to include other senses and make the system into our own best friend. UNI watch faces and expressions, smell things, and understand that words have multiple meanings and various other communications attributes. The devices in our pocket have to become this best friend.
Miller added that we need to consider security such as voice biometrics. It now appears that voices are relatively unique, so a natural user interface would recognize you as an entitled person.
Kaplan suggested that technology has to close the gap between speech and intent. Context and information must acknowledge you, who you're talking to, situations, and all other information resources in context. Integration is both the key and the challenge.

Lots of computer or?
Kaplan noted that change happens over time. The mainframe evolved to the PC which moved to the cloud which is again a mainframe. Private information in devices and other sources will be compartmentalized for security and personalization.
Wozniak added that cloud and information in the cloud takes lots of networks. These networks are replacing parts of our brains and create a mixture of local, cloud, etc. for compute and storage services. The cloud is currently too techie and complicated and can create more problems than it solves. The cloud is not the same as a trusted friend because users have no control over the cloud.
Connolly suggested that although this compute power presents an opportunity. When technology works, it is exquisite. But it doesn't deliver a lot of the time. Information overload, accurate, credible, reliability, and engagement have to deliver value. Consumers buy products based on the number of features, and their greatest disconnect is when those features don't work.

Intent and basic words plus sets of tasks that work together?
Wozniak said that we can just write our own programs and change their own lives. Mobile makes computers available everywhere and can link everything and everyone seamlessly. The apps businesses are integrating information to solve issues.

This integration is being worked on now. Nuance versus Google or Microsoft?
Kaplan answered that we have two separate functions - communications and execution. Basic vocabularies are based on the concept that English is English. Speech is moving towards language interpretation and also developing abilities for planning and other intent-based functions. The challenge is to get all the pieces to connect, develop and adopt standards like semantic Web to enable annotation with apps. Many of these pieces exist but are not integrated. The system doesn't always have to understand correctly, but just has to figure out how to repair any issues. The ease of change is important.

Microsoft has a relatively simple and intuitive interface that depends on VoIP uniqueness, and masks other illiteracies?
Connolly suggested a solution is to flatten language. The interface must move from a directed research into natural language. They change their navigation interface to accept natural language. This change creates the illusion that the technology is getting smarter computers greater influence to the software to decide what to do.
Miller decried the state of simultaneous translations. Lots of stuff doesn't work well yet, and many translations need a review cycle. For example starting with some phrase in the language translating it to another and back to the original language doesn't result in the same statement. Adding in context and intent only complicates this issue.
Wozniak agreed that better programs are a long way off.
Connolly observed that the technology for speech recognition is migrating to everyday use. The technology is becoming common, moving into everyday language, but it's still not intuitive.

Latency?
Wozniak observes that a GUI is slower than a single command line function, but is faster than the totality of all the commands needed for complex functions. The user doesn't need to memorize as much when the device can take care of it. The implication is other inputs are metaphorical speech may degrade as evidenced in the increase in texting and grammar issues in schools.
Connolly appended that teachers now have to tell students that note text abbreviations are allowed in a written test. There are extraordinary implications relating to the user interface issues, because 90 percent of all mutations are nonverbal.

Stakes for machines due to raised expectations?
Miller offered people have a greater sense of betrayal.
Connolly added if the system misses something people yell which makes the system's job more difficult and then start talking very slowly which also exacerbates the system problems.
Kaplan added that people generally take this first mistake as a predictor of future issues. Confidence in the responses will allow forgiveness.

Intimacy factor and frustration? Emotional sensor?
Wozniak answered it's too complex and will never be as good as a person.
Miller responded that speech analytics are getting more investments and one part of this is emotion detection. So far, most efforts are in detecting increases in pitch.
Connolly noted that current systems are capable of detecting drowsy drivers. NASA is investigating tools to measure mental states.

So nothing will tick you off more than being told you're pissed off?
Connolly responded that the technology is less than 10 years old. Currently this interface is a luxury and people have the ability to discount the inputs.
Kaplan noted that use modes are accelerating. This Siri effect is causing more investments into the area. The industry needs to create an image of what may be possible then create a reality that matches that. All the efforts and advances will change with technology.

Personalized inputs and technology?
Kaplan didn't know exactly what this would be. Systems can address needs and be enablers but there is a gap in how the systems will learn your personal preferences. Personalization will require more data for inferences.
Wozniak quipped that some personalization's can be really bad. For example, Pandora is too good in identifying music you'll like, and it keeps you up to all hours of the night listening to new music.
?

Source: http://mandetech.com/2012/09/09/talking-and-listening-technology/

rail gun

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.