Blog by Sumana Harihareswara, Changeset founder
Yahoo! Labs Research Presentations, February 2011: Part I
Hi, reader. I wrote this in 2011 and it's now more than five years old. So it may be very out of date; the world, and I, have changed a lot since I wrote it! I'm keeping this up for historical archive purposes, but the me of today may 100% disagree with what I said then. I rarely edit posts after publishing them, but if I do, I usually leave a note in italics to mark the edit and the reason. If this post is particularly offensive or breaches someone's privacy, please contact me.
Ken Schmidt of Academic Relations started off by talking about all those other industry research labs and their SAD DECLINE. Bell/Lucent/Alcatel, AT&T Labs -- he'd seen them wither into irrelevance. (Despite that awesome Bell Labs Innovation song.) But Yahoo! Labs, started in 2005, is evidently hella relevant and vibrant. [Update: see comment below for clarification.]
Schmidt discussed the various groups within the labs, including Advertising Sciences or "AdLabs," which seems new. As he put it, there's a disconnect between the number of dollars spent on ads and how much time people spend clicking on them. (*cough* social media *cough*)
The recruiting bit: you can pick which lab location to work in. They have a research lab in Haifa! And Schmidt assured us that Yahoo is woman-friendly, led by CEO Carol Bartz and featuring 500+ women in Yahoo's women's group. Mostly the same spiel as last year, including Schmidt's reminder of the Key Scientific Challenges program that gives grad students money, secret datasets, and collaboration.
I have the phrase "'billions & billions' of pageviews, etc." in my notes here and assume it's because Schmidt was pointing out how much data Yahoo! folks get to work with, and what a huge impact they make.
Then: student intros! There were about ten of us there, mostly grad students. Their interests ranged through social networks, data mining, spoken and natural language processing, privacy & security, compilers for heterogeneous architectures, and economics.
First presentation: Elad Yom-Tov's and Fernando Diaz's "Out of sight, not out of mind: On the effect of social and physical detachment on information need". Let's take three example events: the San Bruno gas pipeline explosion, the New York City tornado, and the Alaska elections. People might be interested in searching for information about them because they live nearby, or because they have friends who do. So you measure physical and social attachment/detachment: physical distance and number of local friends.
What data did they use? The query log: userID, time, date, text of query, results, what they clicked (with term-matching, inclusion + exclusion). The location data: the ZIP code the user provided during initial registration. I'm skeptical of that but the researchers say it's fairly reliable, although "more than we [would] expect live in Beverly Hills." And the social network: the number of instant messaging contacts you have who live in the relevant location.
There's a strong correlation with physical detachment, an exponential fit. As for the social network: the more local friends you have, the more likely you are to ask Yahoo! about the event. Beyond 5 friends, the data is noisy, and we don't have a lot of data there, but overall it's a very nice fit, strongest with the San Bruno data. And if you're local, you have more local friends, but that isn't strong enough to explain what we see.
The researchers compared people's queries along the social and physical distance axes, looking for unusual phrases -- phrases that trended up around the dates of the incidents. If searchers are physically close to the incident, whether their friends live there or not, they use the same words. Like, for the NYC tornado: queens, picture, storm, city, nyc. If they're physically far away but have friends in New York, they use terms like brooklyn. [Also mentioned: new(s) and york, but that might be a stemming fail.] But a term that people both physically and socially detached from New York City searched for: statue of liberty. Is that grand lady all right after the tornado? America wants to know! (Yes.)
(The presenter then spoke about clustering queries by words, since different words signify different informational needs, but this bit had pretty bad graphs and I didn't understand.)
So Yahoo! wants to learn to identify relevant querier. "Pacific Gas & Electric" is a legitimate query that people search for on any given day, but it would be possible to programmatically tell that it has a lift due to current news. So Yahoo! wants to act relevantly. In the PG&E example: most days, the search result is and should be the PG&E homepage, but on that day, a top result or sidebar should be a news item about the explosion. A more abstract way of saying this: "learn a retrieval mode for each detachment level." Using social knowledge gets ranking results that are better than just geotargeting -- the difference is statistically significant -- and better even than combining a person's geographic and social detachment levels. "There's a problem with 'both,' probably because of the way we trained our models."
Previous studies suggested that the further a news source is from where something is happening, less likely the news source is to report the event. And Yom-Tov & Diaz's work bore this out. News as a function of distance drops exponentially, as do tweets. Interestingly, when you look at the temporal spike orders (which came first?) on Twitter mentions/searches, the Yahoo! query log, and mainstream news coverage, you see different spike orders for the different events. Sorry, I don't have more details, but I bet the researchers do.
One interesting side effect: we can look at people's queries and infer their location (although you can of course usually also do that with IP addresses). You can even track a hurricane by tracking where the queriers are over time.
These take a while to write up, so I'll save David Pennock's microecon research overview, Sebastien Lahaie's "Advertiser Preference Modeling in Sponsored Search," and Sharad Goel's confidence calibration project for later posts. [Edited 9 June 2014 to say: no, sorry, I will not be adding Part II.]
16 Mar 2011, 15:32 p.m.