The words people use on social media can reveal hidden meaning to those who know where to look.
Linguists have long been fascinated by this notion, connecting a person’s words to age, gender, even socioeconomic status. Now computer scientists from the University of Pennsylvania and elsewhere have gone a step further, linking the online behavior of more than 5,000 Twitter users to their income bracket. They published their results in the journal PLOS ONE.
Daniel Preotiuc-Pietro a post-doctoral researcher in Penn’s Positive Psychology Center in the School of Arts & Sciences led the research, collaborating with Svitlana Volkova of Johns Hopkins University, Vasileios Lampos and Nikolaos Aletras of University College London and Yoram Bachrach of Microsoft Research.
The team took an opposite approach to what psychologists and linguists have historically done: Rather than asking direct questions, the scientists looked at participants’ social media posts, often full of intimate details despite the lack of privacy these outlets afford. Researchers from Penn’s World Well-Being Project, of which Preotiuc-Pietro is a part, are curious about social media as a research tool that can support, or even replace, expensive, limited and potentially biased surveying.
For this experiment, the researchers started by looking at Twitter users’ self-described occupations.
In the United Kingdom, a job code system sorts occupation into nine classes. Using that hierarchy, the researchers determined average income for each code, then sought a representative sampling from each. After manually removing ambiguous profiles — for example, listings referencing the film Coal Miner’s Daughter grouped as “coal miner” for profession — the team ended up with 5,191 Twitter users and more than 10 million tweets to analyze.
“It’s the largest dataset of its kind for this type of research,” said Preotiuc-Pietro. “The dataset enabled us to do something no one has really done before.”
From there, they created a statistical natural language processing algorithm that pulled in words that people in each code class use distinctly. Most people tend to use the same or similar words, so the algorithm’s job was to “understand” which were most predictive for each class. Humans analyzed these groupings and assigned them qualitative signifiers.
Some of the results validated what’s already known, for instance, that a person’s words can reveal age and gender, and that these are tied to income. But Preotiuc-Pietro said there were also some surprises; for example, those who earn more tend to express more fear and anger on Twitter. Perceived optimists have a lower mean income. Text from those in lower income brackets includes more swear words, whereas those in higher brackets more frequently discuss politics, corporations and the nonprofit world.
Aletras noted an overall picture that emerged about Twitter use.
“Lower-income users or those of a lower socioeconomic status use Twitter more as a communication means among themselves,” he said. “High-income people use it more to disseminate news, and they use it more professionally than personally.”
Strong correlations like these, between what the researchers describe as online expression and offline demographics — for example, occupation grouping or income level — also proved intriguing, Lampos added. “This work attempts to highlight some of the potential causal factors in these relationships.”
Such findings will act as a baseline for future work, some of which will investigate how perceptions about user income align with reality.