Netflix Prize Concept + Google 411 Data

I’ve really enjoyed watching the Netflix Prize develop. Amazingly, over 3600 teams have submitted a prediction, which makes Netflix the big winner in this contest. The company will undoubtedly end up with a better product due to the amount of interest and research in collaborative filtering they have generated.

But ultimately, better movie recommendations don’t matter a whole lot to me. I’m more interested in the fact that by providing a unique set of data and a prize, they’ve been able stimulate so much interest. The other day I was thinking about which companies are in a position to sponsor contests in other fields that might have a bigger impact on my life, and one thought jumped into my head – Google’s 411 phoneme collection service. Marissa Meyers says:

You may have heard about our [directory assistance] 1-800-GOOG-411 service. Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model … that we can use for all kinds of different things, including video search.

The speech recognition experts that we have say: If you want us to build a really robust speech model, we need a lot of phonemes, which is a syllable as spoken by a particular voice with a particular intonation. So we need a lot of people talking, saying things so that we can ultimately train off of that

Presumably, Google has already done the heavy lifting to manually transcribe a large number of these samples so that they can train their own algorithms. Why not create a contest that lets teams submit an algorithm that gets trained on a subset of the data and then tested against the rest? Speech recognition is more complicated than movie recommendations, but making it easy to train and test an algorithm against an interesting number of samples would certainly lower the barrier to entry.

Google would benefit from this in hiring, if nothing else. It would give them a chance to realistically evaluate the work of all kinds of grad students and researchers, and demonstrate to the candidates the advantages of working for the company with the biggest databases.

4 Responses to “Netflix Prize Concept + Google 411 Data”

  • This is an awsome concept, i never really thought about why they would give a service like that away for free when it ends up costing 25 cents or more for normal 411

  • Netflix was able to anonymize their data (although according to some reports, not very well) but Google can’t do that with the phonemes dataset. So, they would have to release all or a big subset of the data that was presumably not the easiest or cheapest thing to collect.

  • @Abhik — ah, but they don’t to actually release any of their data. If they allowed teams to submit a program that accepted a audio file and returned a string of text, they could “grade” the submissions without releasing anything.

    They could take this even further and run each submitted program in a training mode on a subset of the data, where the submission is given both the audio and the expected output. All of this could be done internally on Google’s world famous server farms, likely much faster than teams could do on their own.

  • I would enjoy a future of more contest-based employment. Put all those mental cycles to good use. Instead of being the 80,000th person to write a Fibonacci routine in Scala, a person could DO something cool in their spare time…

Leave a Reply