Techgage logo

Austin GDC 07: Interview with Monty Sharma of Vivox

Date: September 10, 2007
Author(s): K. Samwell

At last week’s Austin GDC, we had the opportunity to sit down with Vivox, a company that’s been implementing advanced voice communication technology in many popular online games, including EVE Online. Read on for our comprehensive interview and also thoughts from our experiences with the technology.


Vivox provides online games, virtual worlds and other online communities with robust, integrated voice chat. Vivox delivers superior quality voice chat, video, Instant Messaging (IM) and presence – all of which greatly improve gameplay and social interaction.

Today, Vivox is bringing voice to over one million subscribers in more than 180 countries. Currently, Vivox is being used in EVE Online, Second Life, K2’s War Rock, and Icarus’ Fallen Earth, among other applications. I was shown a video and audio demo which, while tough to outline in text, I will attempt to convey in such a way as to not leave the reader too confused.

First I was taken into world of EVE, where they had a demonstration for me.

Monty: EVE Online we’re currently live with and we actually just issued a new release with them so there’s a few things in the new release that are a combination of improved audio quality, so the audio quality is, I’ll say, fabulous, but you can be the judge of that.

Kiersten: Even for the lower broadband users?

Monty: Yeah we actually cut the amount of bandwidth we use more than 50% in fact the effective rate is now about 30% of what we were using before. So it’s pretty good. And the audio quality is amazing.

At this point, we both donned headphones.

Monty: We’re in game here now with a couple of guys who are back at our Boston office, and in Russia.

What I heard through the headphones, sounded as clear as if he was standing beside me, clearer, considering how noisy the conference floor was at the time. I was able to completely understand both of them, it was pretty impressive. I don’t personally use voice in games but the quality I experienced here, made me realize that I may have to consider it.

Monty: So you can hear the audio quality in the sound and the other thing we have got going on in this build is multi-channeling. We can do up to six channels at the same time so you can scroll through, but you can see how one panel has a red light on it, that is the one I’m speaking on. The one with the blue light, I’m listening to. So in combat there’s a little window that you can pop up that you can click on channels and choose who you’re speaking to that quickly. So you’re hearing all these different channels but you speak to just the channel you want to.

Second Life

Kiersten: And right now you can only speak to one channel at a time?

Monty: Yes you speak to one channel at a time, but we’re looking at a broadcast channel.

(redonned the headphones)

Monty: So one of the other things that we’re dealing is what we call voice fonts. So we’re speaking to an echo channel, you’re going to hear your voice come back and you’ll hear me speak. So this is my base voice, now I take it over and I’m bigger, deeper. The voice is bigger, the resonances are different, we can take you higher, do gender changes, make you sound older, all of that stuff.

Kiersten: Is going higher easier than going lower?

Monty: Uh no, it’s actually the other way around.

So I’m speaking into an echo channel, and Monty selects from a drop down; orc voice. I’m speaking into the mic, and I hear a rather gruff male voice answer me back.

Monty: Nobody would think you were a female with that voice.

Kiersten: Well possibly because it is a higher pitched male voice because there is still a difference between your voice in orc and my voice in orc.

Monty: Yes. Some of the stuff we’re doing in game implementation would give you the ability to control that. So you’d say, you know, I’d want it to be a little more deeper than the base set, and adjust it as such. We announced at deal at GenCon with Wizards of the Coast for DDI Dungeons’ and Dragons, and for being an old pen and paper guy, the ability of the dungeon master to be able to flip his voice between various character races on the fly, really changes that whole experience. It gives you a lot of versatility.

So we spent a year working on the voice font stuff , it’s some crazy math, but yanno the impact is smooth voices, they don’t come out as being really metallic and processed, and the ability to do things like, if I wanted to sound like an emotionless space faring race without naming any names, I can reduce my pitch range down. Now what we do with gender flip fonts is women speak in a wider pitch range than men do, so what we would do is take my voice, and we’d increase the range, or brighten my voice. It would give us a more interesting sounding voice. So all sorts of things like that.

We’ve done some experience now with lilting voices and changing some of the cadence and rhythm of the voice and stuff like that. So there’s a lot of neat things that voice fonts can do.

EVE Online

Kiersten: Are you able to let the user fine tune a little more than just choosing orc from a menu?

Monty: Yes, the way we work, everything is done with the game company so it’s going to be integrated into the game. The way CCP has integrated it [into EVE] works for their game.

At the time, Second Life was down, so the conversation turned a little.

Monty: For Second Life, we’re finding with voice, more people are staying longer in the game, more who try it are converting over to regular users to Second Life. And the whole thing behind that is Second Life is big, and you’ve got to find where something interesting is, and it’s sorta hard to sit there typing. “Hi there, anyone seen anything neat?” But if you’re walking past a group and they’re talking about this cool tree house thing, or this party going on, all of that is changing social dynamics.

What we’re starting to see is the whole recruitment side of it. When I start a game, and I’ve been a game player since Pong, anyway, when I buy a game, it takes me about half an hour before I say “I’m not playing this anymore”, so for a game developer the big issue is about when I get somebody in, can I really make them stay.

Interview Cont.

Kiersten: Now one thing with Second Life, when you do encounter small groups of people, as you’re walking past them are you hearing their channels, and as you walk on you hear the next channel almost like crowd as you’re walking through.

Monty: What we did with Second Life is deploy a wide open audio space so the whole world is in open audio. And that works for them because it’s very social. Over your head you see the talking indicator which looks like sound waves and we’re sending intensity events so I talk like this (gestures with his hands) because it is genetics. When my character talks, he gestures as he speaks and he does all this sort of stuff, so if you’re in a social area in game, that sort of stuff makes it much more engaging than a bunch of avatars standing there like statues.

Getting that richness is a big part of it, so for us we’re really working the game developers to give them all the tools to make it work they want it to work, and then for the players it really means the game can integrate voice in a way that matters and make it easier to play the game.

For EVE, they do a lot of fleet combat, like massive raid parties with 200 people, the fleet is broken up into wings and squadrons, we essentially assigned voice channels to all of those and the fleet commander just points and says I want to talk to this squadron, I want to this group, that group, and gets the message across.

Kiersten: Meanwhile they can be talking amongst themselves without interrupting what’s going on.

Monty: Yes without annoying anybody else. And we can do intelligent things, like when the fleet commander speaks, everybody goes silent. So it’s all these sorts of little things that bit by bit, game developers are integrating into these games and it just creates a much better environment to play in, and it’s simpler. Every body here [at the conference] wants to get more casual users, they want the game to reach a wider base of gamers, well the way to do that is to make it simple for them. The simpler it is, the friendlier it is, the better.

Kiersten: Now what about accents? How does it handle accents and in the future will you be able to give yourself an accent?

Monty: We’re actually, well, there’s a lot of research going on, we’re partners with IBM, so there’s a couple of things we’re doing with text to speech and speech to text, and there’s a group is devoted to removing accents. The voice font team is actually looking at some ways of adding accents. So our whole thing is we want an immersive event where you can really tweak your voice to the point where you’re happy with the sound.

It’s great for roleplay, the worst thing about roleplay right now is even the people who most want to do it, end up sounding like Minnie Mouse. So it’s that sort of thing that breaks it, and, for example, I have a ten year old son and I can make him sound like his sixteen year old cousin to the point where my brother can’t tell the difference between the two.

Fallen Earth

Kiersten: I actually can’t wait till you do more with accents. I mean, I have an accent, a faint one, but I’d love to sound Irish or something different, I think that would be awesome.

Monty: Yeah we’ve done some lilting stuff and the biggest thing is we can’t do it in advance of where you are but rhythmically we can. Some tests we’ve done have been uncanny and we can’t figure out of it’s uncanny just because of what the person said or it really did hit the right cadence.

Kiersten: Now that’s something I was wondering about, but right now the police are using chat rooms to trap predators. If you’ve got an adult and you can make him sound like a 10 year old girl, and it’s convincing, is that something you’re looking into?

Monty: We’re not looking into it, but we’d be happy to help.

Monty: To talk about more about the text to speech stuff mentioned earlier, one of the new things we’re working on is speech to text. Our ultimate desire is whether you’re speaking or typing, it’s all just mixed. So we go speech to text and text to speech.

Kiersten: In real time?

Monty: Yeah relatively real time. Actually text to speech real time, very little delay. Speech to text is still going to be a little while. So this is a sample I’ll play for you. This is two people trying to talk a third person out of a station in EVE to fight them.

At this point, I realize the reader is at a loss, but I’ve contacted Monty about providing a sample wav file for you to listen to. When it arrives, I’ll flash it in the news so keep watch!

Quick synopsis, I could clearly hear at least three people talking, it was easy to tell which of the three were talking based on their voice patterns, one was definitely female. And while there was still a slightly unnatural sense to it, due to lack of inflection, it was very, very well done, there were no words that were mispronounced or misplaced.

Kiersten: I am impressed that not everyone sounds mechanical.

Monty: The voice quality is very good, and actually later on they start swearing and we’re working with IBM on this piece and IBM, in their database, the way they’ve got swear words done better than I’ve heard out of most 10 year olds. Which is cool, like they mean it. Which is really the key element.

The big thing we’re going to start to experiment with on this is starting to work on what we call the loader voices, so right now all of the database is essentially converting text into sound, and all the databases are based on happy voices, but sometimes you want angry voice, pissed off voice, drunken voice, sarcastic voice, so the next effort is going to be to link in a series of these for each voice to have different libraries so then you just stick an emoticon in front of the text and it will pick out of the right database and make you sound emotionally that way. As you start to add those things, it starts to get more real, it sounds better, it drives a whole different environment.

Kiersten: One of the things I’ve always found that you miss in text is the inflection, you can take a sentence a lot of different ways when it’s written in text which is why there’s always the smileys at the end to indicate I’m just kidding and other indicators.

Monty: Yes that’s where this definitely comes in, it’s those sorts of things that will start to bring it in. And again that starts to impact things like accents and other stuff.

Kiersten: With your positional sound, you’ve got it set up so that all my healers are on my left so I know that’s who’s talking, is there also relative positional sound? For example this person is behind me on the right, at the moment, will he sound like he’s behind me on the right in my headphones

Monty: Attenuation, yes, and not only will he sound like he’s behind you on the right, but he will get fainter as he moves away. And you can change that on the fly. So I walk into Ironforge, I want my attenuation range in, so I get a little buzz around me, but unless someone is very close, I don’t hear anyone distinctly. In a party scenario, I want that range to be large, so in the span of where we’re playing I get the sense that you’re in the distance over here but I can also still hear you clearly. To whisper we pop you into a private channel. It’s all up to the game.

Interview Conclusion

Kiersten: How soon are you going to be getting onto cell phones?

Monty: We actually do some stuff on cell phones right now. There’s a phone booth in Second Life right now and you can walk into the phone booth and use the telephone and make any regular phone ring. You can dial in from a regular phone. We’re working with a couple of handset carriers right now and taking the code and putting it onto the handset. So you’re actually native in EVE but I’m not paying minutes or anything like that.

We also do SMS out of this, for example in EVE if a base is under attack, you can get people to log back into the world. We have an IM client as well, and we’re actually able to do things like show that you’re in game, or show that you’re online but not in the game. We’re looking at that and breaking it down by identity, I know you’re online and in this game, and I can talk to you while I’m in game and playing something else, or I’m at work, so a lot of this proxying of identity and massive gaming and all of that. So from the game to the desktop to the cellphone, people can link it all up.

Kiersten: Now something about the cell phone aspect of it, does using and giving your cell phone info to the system in any way compromise your phone number?

Monty: One of the first things we started doing is called Anonymous Caller. So in all of the integrations we do, if you call me on my cell phone, you don’t get to see my number. The system does it, the system knows my number, the system knows who you are and patches me through. You send me an SMS, you don’t know my number, I reply and you STILL don’t know my number. So what we’ve done is mediated in between, so we’ve handled and anonymized the whole thing.

Kiersten: Are there plans to release stand alone Vivox clients for players who’d like the technology in games that are not officially supported? For example, someone like me who doesn’t necessarily play Second Life, but wants to talk to people who play Second Life?

Monty: We can see that happening. I think initially it will be in partnership with games. It’s not slated but I wouldn’t doubt it would happen. We went live with Second Life a month ago and we’re doing three hundred million minutes a month. It’s huge.

Kiersten: Is Second Life the only one that provides visual clues as to who is speaking in game?

Monty: No EVE does as well, the ship glows red. War Rock also has visual clues but they all do them slightly differently. War Rock does it based on an indicator over your head or off the screen to show who is speaking now.

Kiersten: Games that are rated T for Teen, are you going to offer any type of language filtering both in text and in speech.

Monty: No you can’t do it in real time. Instead we’ve got multiple layers of behaviour controls. The first step of behaviour controls is self moderation. I can mute you personally. Now I can then take that up to the next level to moderator controls so the party leader can say ‘hey you’re out of line, you’re out of the channel’. Then we kick that up to what we call reporting, so some of these games that we’re talking to are targeting younger demographics and want to record every channel. Now we only keep a short loop, kind of like TiVo, we record everything then throw it away, but with the reporting we can grab a loop, hold it, and then a GM can read it and say, ‘yeah, you were a jerk, you’re outta here’.

War Rock

Kiersten: You’ve also got your speech to text file, it logs. It’s a big file but it logs it all.

Monty: Yes that’s exactly where we’re headed with that. And the final stage of that and we’re talking to a couple of companies about this, is we’ve developed models on reputation tagging. So we started out with just the essential basic concept of ‘I tag this guy because he curses all the time’ and you tag him because he curses all the time but a third person doesn’t tag him because she doesn’t think he curses too much. And a fourth person also doesn’t think he curses. So when he comes up against the data, it doesn’t show his name as someone who curses, because the tags balance.

Now if everyone says this guy doesn’t curse but I say he does, and I’m the only one who says he curses or my clan is the only one who says he curses, we throw that out because you’re ganging up on him. So there are a lot of ways to integrate that for anti-griefing. With tagging it’s also aged filtered. I’ve got kids and I as a parent can say I want to put the age filter on. They’re not talking to anybody more than three years older than they are. So we have a mind numbing list of controls but it should really fit the game. It should be the environment, the atmosphere, the experience that the game devs want.

Kiersten: How hard is this going to be to retrofit into a game that’s quite a bit older than say, EVE, which is only 4.

Monty: It’s not hard, all the stuff we’ve seen, all the companies we’ve worked with core integration is straightforward, the biggest effort ends up being how much UI tweaking they want to do. And other than that, it’s not an issue.

Kiersten: Is there anything you want to cover that maybe we haven’t touched on yet?

Monty: No I think we’ve hit all, very thorough. I think that companies are hearing that we can integrate voice well. We’ve heard of poor integrations that the companies have picked up but they don’t use. When you do it right, it actually drives game revenue and that’s huge. Makes my day.

Kiersten: Thank you very much for your time, Monty.

Discuss in our forums!

If you have a comment you wish to make on this review, feel free to head on into our forums! There is no need to register in order to reply to such threads.

Copyright © 2005-2021 Techgage Networks Inc. - All Rights Reserved.