At last week’s Austin GDC, we had the opportunity to sit down with Vivox, a company that’s been implementing advanced voice communication technology in many popular online games, including EVE Online. Read on for our comprehensive interview and also thoughts from our experiences with the technology.
Kiersten: Now one thing with Second Life, when you do encounter small groups of people, as you’re walking past them are you hearing their channels, and as you walk on you hear the next channel almost like crowd as you’re walking through.
Monty: What we did with Second Life is deploy a wide open audio space so the whole world is in open audio. And that works for them because it’s very social. Over your head you see the talking indicator which looks like sound waves and we’re sending intensity events so I talk like this (gestures with his hands) because it is genetics. When my character talks, he gestures as he speaks and he does all this sort of stuff, so if you’re in a social area in game, that sort of stuff makes it much more engaging than a bunch of avatars standing there like statues.
Getting that richness is a big part of it, so for us we’re really working the game developers to give them all the tools to make it work they want it to work, and then for the players it really means the game can integrate voice in a way that matters and make it easier to play the game.
For EVE, they do a lot of fleet combat, like massive raid parties with 200 people, the fleet is broken up into wings and squadrons, we essentially assigned voice channels to all of those and the fleet commander just points and says I want to talk to this squadron, I want to this group, that group, and gets the message across.
Kiersten: Meanwhile they can be talking amongst themselves without interrupting what’s going on.
Monty: Yes without annoying anybody else. And we can do intelligent things, like when the fleet commander speaks, everybody goes silent. So it’s all these sorts of little things that bit by bit, game developers are integrating into these games and it just creates a much better environment to play in, and it’s simpler. Every body here [at the conference] wants to get more casual users, they want the game to reach a wider base of gamers, well the way to do that is to make it simple for them. The simpler it is, the friendlier it is, the better.
Kiersten: Now what about accents? How does it handle accents and in the future will you be able to give yourself an accent?
Monty: We’re actually, well, there’s a lot of research going on, we’re partners with IBM, so there’s a couple of things we’re doing with text to speech and speech to text, and there’s a group is devoted to removing accents. The voice font team is actually looking at some ways of adding accents. So our whole thing is we want an immersive event where you can really tweak your voice to the point where you’re happy with the sound.
It’s great for roleplay, the worst thing about roleplay right now is even the people who most want to do it, end up sounding like Minnie Mouse. So it’s that sort of thing that breaks it, and, for example, I have a ten year old son and I can make him sound like his sixteen year old cousin to the point where my brother can’t tell the difference between the two.
Kiersten: I actually can’t wait till you do more with accents. I mean, I have an accent, a faint one, but I’d love to sound Irish or something different, I think that would be awesome.
Monty: Yeah we’ve done some lilting stuff and the biggest thing is we can’t do it in advance of where you are but rhythmically we can. Some tests we’ve done have been uncanny and we can’t figure out of it’s uncanny just because of what the person said or it really did hit the right cadence.
Kiersten: Now that’s something I was wondering about, but right now the police are using chat rooms to trap predators. If you’ve got an adult and you can make him sound like a 10 year old girl, and it’s convincing, is that something you’re looking into?
Monty: We’re not looking into it, but we’d be happy to help.
Monty: To talk about more about the text to speech stuff mentioned earlier, one of the new things we’re working on is speech to text. Our ultimate desire is whether you’re speaking or typing, it’s all just mixed. So we go speech to text and text to speech.
Kiersten: In real time?
Monty: Yeah relatively real time. Actually text to speech real time, very little delay. Speech to text is still going to be a little while. So this is a sample I’ll play for you. This is two people trying to talk a third person out of a station in EVE to fight them.
At this point, I realize the reader is at a loss, but I’ve contacted Monty about providing a sample wav file for you to listen to. When it arrives, I’ll flash it in the news so keep watch!
Quick synopsis, I could clearly hear at least three people talking, it was easy to tell which of the three were talking based on their voice patterns, one was definitely female. And while there was still a slightly unnatural sense to it, due to lack of inflection, it was very, very well done, there were no words that were mispronounced or misplaced.
Kiersten: I am impressed that not everyone sounds mechanical.
Monty: The voice quality is very good, and actually later on they start swearing and we’re working with IBM on this piece and IBM, in their database, the way they’ve got swear words done better than I’ve heard out of most 10 year olds. Which is cool, like they mean it. Which is really the key element.
The big thing we’re going to start to experiment with on this is starting to work on what we call the loader voices, so right now all of the database is essentially converting text into sound, and all the databases are based on happy voices, but sometimes you want angry voice, pissed off voice, drunken voice, sarcastic voice, so the next effort is going to be to link in a series of these for each voice to have different libraries so then you just stick an emoticon in front of the text and it will pick out of the right database and make you sound emotionally that way. As you start to add those things, it starts to get more real, it sounds better, it drives a whole different environment.
Kiersten: One of the things I’ve always found that you miss in text is the inflection, you can take a sentence a lot of different ways when it’s written in text which is why there’s always the smileys at the end to indicate I’m just kidding and other indicators.
Monty: Yes that’s where this definitely comes in, it’s those sorts of things that will start to bring it in. And again that starts to impact things like accents and other stuff.
Kiersten: With your positional sound, you’ve got it set up so that all my healers are on my left so I know that’s who’s talking, is there also relative positional sound? For example this person is behind me on the right, at the moment, will he sound like he’s behind me on the right in my headphones
Monty: Attenuation, yes, and not only will he sound like he’s behind you on the right, but he will get fainter as he moves away. And you can change that on the fly. So I walk into Ironforge, I want my attenuation range in, so I get a little buzz around me, but unless someone is very close, I don’t hear anyone distinctly. In a party scenario, I want that range to be large, so in the span of where we’re playing I get the sense that you’re in the distance over here but I can also still hear you clearly. To whisper we pop you into a private channel. It’s all up to the game.