- Install Highlighter for your Chrome browser from the Google store here.
- Highlighter uses your recent browsing history to get to know you and build your Interest Graph based on the content you’ve been reading
- As you surf the web, Highlighter shows this handy dandy tab when it thinks there are stories on the current site you’re visiting that you’ll like
- Click the tab to see the whole newsfeed of content Highlighter has picked just for you.
- The more you use it, the smarter Highlighter gets.
As an American male living in Los Angeles, I am legally obligated to feign enthusiasm when baseball is discussed. Unfortunately, the limited extent of my knowledge on the matter can be easily summed up with my go-to baseball talking points:
- Go Dodgers! (or another local team in the event that I move)
- How about them Yankees? (I’m not sure exactly what this means but generally is received with thoughtful nods)
- The Giants suck. (I have no idea if this is true, but they are the Dodgers’ rivals and so are subject to some level of local derision).
Based on the success of these points in faking my way through repeated baseball exchanges, rivalries seem to be core to understanding baseball and the motivations of its many devotees. So let’s take a look at a couple of the biggest rivalries in the game: Red Sox versus Yankees, and Dodgers versus Giants.
For this study, interest in a specific baseball team was used to define a set of users across the Gravity personalization network. A total of 2.25 million Dodgers fans, 1.25 million Giants fans, 3.75 million Red Sox fans, and 3.5 million Yankees fans met our selection criteria. Their individual Interest Graphs had previously been created for the purposes of personalization of content; every click on content or advertising within the network is analyzed semantically and augments the nodes and edge weights for the engaged user.
When analyzing audiences, the individual graphs are coalesced into a single graph reflecting the aggregated interests and attachment levels for the entire set. This allows for a holistic view of the audience which can then be compared against another audience or the general population of hundreds of millions of users. In this way, we are able to establish what defines an audience and what makes them special. For our baseball analysis, we’ve created Interest Graphs for each fan base, and have compared some of their more compelling attributes against each other and the global set. Let’s check out some of the highlights.
- Dodgers fans….
- Tend to be the most focused on health and beauty of the four groups.
- They are interested in technology generally, but less focused on specific websites or gadgets.
- They are obsessed with the Giants, but also are more focused on the Yankees than the Red Sox are.
- Dodgers fans are 80% more likely to love the sparkly Twilight vampires than an average Internet user.
- Giants fans…
- They love technology, and are quite specific about the brands and properties that are important to them. There is a surprising level of interest in Yahoo comparable to that of Google, but Facebook reigns supreme.
- Far outstrip the other teams and the general public in their love of ESPN.
- Are primarily interested in hobbies that require participation rather than spectating.
- Put the other teams to shame with their staggering love of Booze, Cannabis, and Cocaine.
- They are 50% more likely to get all screamy about Justin Bieber than an average Internet user.
- Red Sox fans…
- Are the most socially minded of the four fan bases.
- They have the greatest focus on local and cultural traditions (hurray for St. Paddy’s Day).
- Edge out the competition in their love of beef and video gaming (a very different Saturday night than Giants fans apparently are having).
- Are more than 5 times more likely to put on their Bronie gear and watch My Little Pony than an average Internet user.
- Yankees fans…
- Have an unusual Interest Graph composition. While they have a relatively low number of topics in which they are highly interested, the volume of topics they cover is well above normal levels for a cohort of this size. This is likely a function of the general appeal of the Yankees to fans that are more geographically and demographically diverse.
- They lead the pack in their interest in exotic dancers and nude recreation.
Gooooooo, local sports franchise!
Let’s play a game. Say the top five things you’re interested in out loud. GO!
If you’re like me, and the vast majority of people I’ve tried this with over the years, your process went something like this:
- Draw a complete blank
- Repeat “Ummmmm” a few times
- Say some vague words like TV or internet.
Welcome to the wonderful world of explicit user data! Explicit feedback comes in light servings and is corrupted by bashful and boastful self-reflection. It’s a problem we dealt with extensively in the social context at MySpace, and the personalization space here at Gravity. After reading about Netflix’ experience with explicit data, I thought it might be helpful to dive into some of the things we’ve learned over the years.
While there is infinite variation in the quality of explicitly collected data, problems tend to fall within four broad categories. Let’s go through them.
Blank box syndrome
People are not great at coming up with data about themselves. I am complex, how do I represent myself in a list? It is not uncommon to watch users in testing completely freeze when presented with a blank “Describe yourself” field. These sorts of open ended questions tend to produce data sets that are quite incomplete (I can only come up with two things off the top of my head) or overly vague (I say “sports” when all I really like is baseball). How you collect explicit data is critical to final quality. Help the user make quality responses.
*Note: Also beware of stale data when dealing with blank boxes. At MySpace we found that you could pretty accurately peg the creation date of a profile by what movies were listed as favorites. People tended to pull whatever was currently in theaters and never update that list again.
Peacocking, the introduction of spurious data as “decoration” to create an idealized public personae, is a byproduct of the socialization of the Internet. As more of our interaction online is visible to our peers, there is a tendency to present the idealized self. This is rampant with data users volunteer that they know will be public like social profiles. I don’t like “The Notebook”, but putting it on my profile sure makes me look sensitive.
This is the flip side of the coin from peacocking. “Party in the USA” is a great song, but I probably won’t be sharing it (for an amazing example check out the last.fm list of songs most deleted from public scrobbles). This also is primarily a problem when dealing with data that will be available to a user’s peer. Where peacocking introduces false data, self censorship prevents what is often critical information from being made available.
The Aspirational Self
While peacocking is the intentional introduction of spurious data for public view, the aspirational self is trickier. Even in entirely private venues, users will often provide data reflecting the person they want to be rather than who they really are. As the Netflix guys put it to Wired: ” People rate movies like Schindler’s List high, as opposed to one of the silly comedies I watch, like Hot Tub Time Machine. If you give users recommendations that are all four- or five-star videos, that doesn’t mean they’ll actually want to watch that video on a Wednesday night after a long day at work. Viewing behavior is the most important data we have.”
So there you go. Explicit data has to be carefully managed on both the collection and interpretation fronts or it can easily lead to incorrect conclusions and courses of action.
Pop quiz: You’re driving home from work and suddenly overcome by hunger. Where do you stop for food?
If Taco Bell was the first thing to pop into your head, let us tell you a bit about yourself:
- Consuming content in its myriad forms is a huge part of your day. The NFL, Netflix, Spotify and Reddit are your standard fare.
- Personal finance is actually top of mind if you crave Doritos shell tacos. E*Trade, unemployment and Dow Jones are important topics to you.
- You’re actually 77 percent more likely to be interested in venture capital than people who would have picked Whole Foods for a snack.
Speaking of Whole Foods, if that was your first thought:
- You care about healthy living. You are especially interested in vegetarianism and veganism compared to the general population, and you’re 2.8 times more likely to check ingredients than your Taco Bell-choosing peers.
- You are nearly five times more likely to be interested in alcohol than Taco Bell fans.
- Playing sports is more your bag than watching sports, but if you are going to watch a game, it would most likely be Major League Baseball or hockey.
These insights, as well as tens of thousands of others associated with each audience, are available through Interest Graph analysis. Unlike the social graph (who you know) or retargeting (sites you’ve been to), the Interest Graph quantifies motivations (what you care about and how that is trending). It is the digital representation of what drives the behavior of a single human or audiences of millions, and it’s going to change everything.
Consider our Taco Bell and Whole Foods fans. In this case, a minimal level of interest in a specific food retailer was used to define a set of users across the Gravity personalization network. A total of 150,000 Whole Foods fans and 70,000 Taco Bell aficionados met our selection criteria. Their individual Interest Graphs had previously been created for the purposes of personalization of content; every click on content or advertising within the network is analyzed semantically and augments the nodes and edge weights for the engaged user.
When analyzing audiences, the individual graphs are coalesced into a single graph reflecting the aggregated interests and attachment levels for the entire set. This allows for a holistic view of the audience which can then be compared against another audience or the general population of hundreds of millions of users. In this way, we are able to establish what defines an audience and what makes them special. The results are often quite surprising. Take a look at the infographic to see how these two almost mutually exclusive groups compare.
As an American, I take a certain amount of pride in the high standard of living made possible via an economy built on liquified dinosaurs (sweet sweet fossil fuels). There are some subversive elements in our society, however, that would like their grandchildren to have breathable air and non-poisonous water. They have elected to drive electric cars. While this post is not sponsored by the kind, almost Colonel Sanders-esque, big oil companies, I am sure they would appreciate our attempt to shine a light on these alternative energy deviants. So let us consider the Prius, the Tesla Model S, and their devotees.
In this case, a minimal level of interest in a specific automobile model was used to define a set of users across the Gravity personalization network. A total of 900,000 Prius fans and 183,000 Tesla Model S aficionados met our selection criteria. Their individual Interest Graphs had previously been created for the purposes of personalization of content; every click on content or advertising within the network is analyzed semantically and augments the nodes and edge weights for the engaged user.
When analyzing audiences, the individual graphs are coalesced into a single graph reflecting the aggregated interests and attachment levels for the entire set. This allows for a holistic view of the audience which can then be compared against another audience or the general population of hundreds of millions of users. In this way, we are able to establish what defines an audience and what makes them special. The results are often quite surprising.
Let’s take a look at the Prius cohort highlights.
The first data set is the overall Interest Graph on the left of the infographic. It describes what folks interested in the Prius care most about.
- They really like technology, with a strong emphasis on Apple, Facebook, Twitter, and Bioengineering.
- Among their favorite media are The Atlantic and the New York Times. They are closely following matters related to international security and right-wing politics (not necessarily because they agree with them).
- Perhaps unsurprisingly, they are keenly interested in Eco-friendly subject matters and Social Change.
The second data pivot highlights some of the areas of interest where Prius fans differ substantially from the global set of users. Basically, it’s what makes Prius fans special compared to everyone else in the world. The results are expressed as an over/under index compared to global set. Some of the results are surprising (this is why we like data mining). Compared with the global set, Prius fans are:
- 1.2 times more likely to enjoy physical exercise
- 2.5 times more likely to like Yoga
- 1.3 times more likely to be interested in McDonald’s
- 9.1 times more likely to dig Tyler Perry
- 13.6 times more likely to love Apple
- Slightly less likely to be interested NASCAR, Football, and Waffles (their Sundays must be so empty)
- 6.3 times more likely to be concerned about Sustainability
- Just as likely to enjoy Ninjas as everybody else
By the time we got the Prius data together we were in a full data mining extravaganza. There was no way we could let the fun stop with just one infographic. How about people who are interested in the Tesla Model S? It’s another electric car. How different could that audience be from the Prius folks?
Turns out that people interested in the Tesla Model S are materially different in their interests than the Prius crowd. Let’s look at some of the highlights of the Model S fans’ aggregated Interest Graph.
- Environmentalism is not a substantial area of interest in the Tesla Model S Interest Graph. This may indicate that Tesla interest is driven by the technological or aspirational aspects of the brand rather than it’s environmental benefits. This is supported by the fact that, while the Tesla audience is 8.5 times more likely to be interested in fuel efficiency than the general population, Prius fans are 4x more likely to be interested than the Tesla folks (34x general population).
- Business and finance were dominant categories of interest. Interest rates, economics, the economy of Saudi Arabia, and bubbles, both real estate and economic, were high on the list.
- They are thirsty. Maker’s Mark, breweries, and Jim Beam are Tesla aficionado favorites.
- They like things that go. private transportation, Ferraris, and motorcycles were of high interest.
- Their lifestyle interests are a mixed bag. While family topped the category, Tesla fans do exhibit abnormally high levels of interest in cannabis, erotic dance, strip clubs, and Lululemon. It is unclear from the data available whether the Model S/”big night out” relationship is causal or simply correlated.
On the left, you’ll see some of the traits that differentiate Tesla fans from the global set of users. As with the Prius set, the results are expressed as an over/under index compared to global set. Compared with the global set, Tesla fans are:
- 2.5 times more likely to be wired on coffee
- 90% more likely to be interested in social issues
- 5.2 times more likely to be interested in cannabis
- 4.1 times more likely to like magazines
- 41.8 times more likely to be interested in SpaceX
- 3.8 times more likely to like Evernote
- 10.7 times more likely to dig Marissa Mayer
- 96 times more likely to be concerned about traffic congestion
Let’s take it home with some Tesla versus Prius comparisons.
- Prius fans are 28.1 times more likely to be interested in wearable computers than Tesla fans.
- Prius fans are 3.3 times more likely to be interested in cannabis than the general population, but Tesla fans are 44% more likely than Prius folks to enjoy.
- Prius fans are 1.62 times more likely to be interested in sustainable transportations than Tesla fans.
- Tesla and Prius fans are both more than twice as likely to be into music than the general population.
I hope you enjoyed these audience analyses as much as we enjoyed doing them. Let us know if you have ideas on data you’d like to see in the future.
I came across a really interesting article the other day on Mashable (Google Knowledge Graph Could Change Search Forever). Google SVP Amit Singhal lays out their efforts around a more semantic understanding of the web leveraging their purchase of Freebase a couple of years ago. The gist is that by leveraging a proprietary Knowledge Graph, Google will be able to return search results based on the meaning of documents rather than simply the presence of particular text strings. It’s a really compelling vision and well worth reading. Personally, I’m terribly excited about the prospect of not only a truly semantic search, but the proliferation of data systems that are backed by large scale ontologies. The power of ontology based semantics is a basic tenet of everything we do at Gravity, and it always feels good to see folks like Google moving in the same direction. For those of you not thoroughly enmeshed in this sort of tech (which is just about everyone), a bit of explanation is probably in order.
What is an ontology?
The simplest way to imagine an ontology is as a graph that shows how things are connected to each other (if you’re already familiar with the nuances of graph theory, RDF, and convergence algos, feel free to skip ahead). Take the example below from our ontology:
This is a small subset of the many things Kobe Bryant is actually connected to. A ontology allows you to not only crawl a page and recognize that “Kobe Bryant” is contained in the text and an entity of note, but now you can imbue that article with additional meaning. Kobe’s presence in a document may be indicative of a web page being conceptually about famous people, basketball, the Lakers, or celebrities who cheat. We can now move past simply understanding of what’s on a web page and grasp more concretely what it’s about.
Now that was a single entity in the ontology. Google’s ontology and our own have millions of entities and abstract concepts all interconnected with hundreds of millions of edges. Topics run the gambit from every person of note throughout history to every song ever recorded to diseases of every flavor. I can’t speak for Google’s system, but we maintain various weights on those interconnections (Kobe is more tightly bound to “Los Angeles Lakers Players” than “American expatriates in Italy”). In this way we are able to more easily infer document aboutness.
What’s the point?
Per Mr. Singhal, Google is applying this semantic understanding of content to search. Would you like results about Kobe as a basketball player, or would you rather see pertinent celebrity gossip? The ontology allows Google and the user to make that distinction when applied applied to the set of content that includes Kobe as a component. You can also introduce any number of semantically proximate suggestions to searchers. Searchers for “surfing” could easily be presented with the opportunity to explore relevant results for the more abstract “water sports” or the more specific “longboards”. With an ontology we can place topics in their proper context within the set of everything else that exists.
We leverage similar technology to a very different end. By understanding what every article is actually about, we can consider what pages you engage with to build a holistic picture of those topics and concepts that actually matter to you (your Interest Graph). That then can be used to present you with content, ads, and other people that you’ll probably enjoy (see a lot more about that here).
For those of you that are just discovering ontologies, I hope this was a helpful introduction. If you’re in the space, we always love talking shop. Drop us a line.
Gravity was born under interesting circumstances. Amit, Jim, and I had joined MySpace early on, and by the end of 2008 were running the business, tech, and product initiatives respectively (at a time when that was a good thing to run). The three of us had been operating as a team for years and always knew that we’d start a company together at some point. The real question was what to build.
Social was the obvious choice given our backgrounds (we’d gotten on the social train at a time when you had folks willing to violently argue with you that no one would ever put their picture online). But by the end of 2008 it was pretty clear that social was on a fairly well established trajectory, and, to a certain degree, a solved problem. Sure, the particulars were still in flux and market dominance was very much in dispute, but the web of “us” was no longer uncharted territory. The foundational behavioral frameworks were all in place. So if the problem of “us” had been tackled, what other ridiculously audacious project could we tackle?
I’m not sure exactly which of us suggested it, but the idea of personalizing the whole web for every user came up and seemed appropriately audacious. We founded Gravity , and here we are bringing that dream to fruition with our first implementations. I’m reminded of those early MySpace days I spent explaining to every major web company that social networking was going to change everything and getting only blank stares in return. So let me say something along those lines about personalization. Personalization is not a feature; it is an infrastructure. The power of the social web isn’t widgets or share buttons, it is the ability to see the world through the lens of your friends. The power of the personalized web is not about recommendations, it is the ability to see the web through a lens that is as utterly unique as you are.
All of that being said, it turns out that personalizing the web is pretty tricky, and not simply from a technical execution perspective. Rather, one of the biggest hurdles of the endeavor is pinning down exactly what is entailed in “personalization.” What qualifies one particular match candidate (piece of content, potential friend, ad, etc.) as a better personalization result than another? Having spent a healthy chunk of time thinking about exactly that problem, we have some thoughts to share.
The Gravity Approach
After a lot of meditation and a number of failed attempts, we’ve settled on what we believe to be the right way to go about personalization. Our method relies on a number of signals to value an object’s inherent worth and then combines that with a holistic picture of a user to render a set of personalized results that should yield optimize for user happiness. That’s a mouthful, so we’ll break it down.
The Interest Graph
The foundational component of our system is the Interest Graph. This is a digital representation of the things you care about and the relative levels of attachment to those things. I, for instance, am very attached to surfing, start ups, and parenting. I’m only moderately interested in poodles, iPhone apps, and 3D printing. Not to be confused with simple behavioral targeting that puts me in binary interest buckets, the Interest Graph has attachment gradients, and a memory that allows for calculation of trajectories and trends. Interests wax and wane (looking at you, LA Gear fans), and properly projecting patterns at the individual or aggregate levels can be very useful.
Building the Interest Graph can be done in a few ways. It can be explicitly volunteered by a user (What are you interested in?). It can be implicitly derived (What are you reading on my site?). It can be inferred from the things you say (Connect your Facebook/Twitter and let’s have a look at what you’ve been liking, status’ing, and tweeting). Really, any signal of user interest can be employed to increase or decrease a user’s attachment to any topic under the sun. And if you handle your ontologies correctly, you can infer attachment to the larger related concepts (Love the Lakers? Here’s what else is hot in the NBA…).
This approach seems simple enough, but, of course, you have to be able to derive the essential meaning of the things with which a user interacts in order to be able to imbue a user an attachment to the appropriate interests. This is the hard core semantic science of what we do, and well beyond this simple product guy. I’ll leave that to the tech gang to explain more competently in another post.
Learn More about Gravity’s Technology here.
Let’s review. To calculate the Interest Graph for any human:
- Understand the objects they create or interact with
- Divine the meaning of those objects
- Modify their attachment to those meanings based on the type of behavior over time
Great, now we have Interest Graphs. Hurray! Hold your horses, little buckaroo. Having an Interest Graph is like having a map, tells you where to go, but you still have to get there. Cue the section on personalization.
Discovery, executed correctly, is a beautiful thing. The books you didn’t intend to buy, the people you didn’t set out to meet, these are the serendipitous discoveries that add color to our lives. This is the ultimate goal of personalization, to show you the things you’ll love that you didn’t know you should be looking for (all needles, no haystacks). To accomplish this goal, you have to consider a pretty broad set of signals. Together, they produce a composite score indicative of correctness for a particular user. Here’s what we consider:
Interest Graph Proximity
Remember our process for calculating a human’s Interest Graph? We do a similar process for every content object. Comparing every user to every object, we can confidently say that a particular object is closely relevant to this person’s interests. The results are actually very good and exceedingly relevant. The problem with deploying a solution using solely this approach is the lack of serendipity. It’s predictable and, to a certain degree, boring. Read a lot about Apple? Here’s more Apple. Mostly reading about iPhones from that set? Now it’s mostly iPhone. The process tends to winnow results to an unacceptable level of specificity over time. It’s almost like having a set of saved searches that slowly morph based on their own self-referential activity. This was one of our early learnings leveraging the Interest Graph, and one we took to heart. Truly excellent personalization must be something more than this. Enter content value as a tunable serendipity measure.
If you can effectively determine the inherent value of a content object, this value can be combined with Interest Graph proximity. Together they give you a set of content that is relevant to your interests and serendipitously important. The set of things that you both want to see and ought to see. Not particularly interested in tsunamis? Doesn’t mean that you won’t be when they happen. So how do you determine the value of an object? A few vectors are considered:
- Editorial weight – There are people out there paid to know what is important. Call them tastemakers, pundits, or editors, their opinions matter. Recognizing and weighting their guidance can strongly indicate an object’s importance.
- Virality – Every share, tweet, digg, link, and search is an indication of collective interest in an object or its associated semantic topic. Where once there was only the linking behavior of webmasters, the universe of user generated content has enabled each of us to indicate what links matter within the superset. Monitoring the public streams and meta data provides pure signal of the things that matter right now. The Twitter firehose, among others, is a great mechanism for teasing the gold from the stream if you know how to properly parse the vastness that these data sets represent. We combine all of these signals into our virality calculations.
- Interaction feedback – What happens when an object is presented in a personalization context? Even when properly targeted based on the combined graph proximity and content value, some content just falls flat while others unexpectedly surge. Constant tuning based on the interaction of users with the targeted content optimizes the results for everyone.
See what gravity personalization looks like here.
So where does all this take us? We imagine a web where every experience is personal, viewed through the lens of my own interests with a healthy dollop of serendipity on top. Where not only the presentation of content is informed by my interest graph, but the production of content is informed by our collective interests. Editors are not replaced, but rather they operate with a level of transparency and sophistication previously unheard of. Where each of us are able to exorcise the noise from our view and focus only on the gems scattered across the web. It won’t be easy and it won’t be fast, but that’s the future as we see it.