Demographic representativeness of Twitter data: Is it valid for public health surveillance?
Objective: Twitter data were frequently used for public health purposes such as surveillance. Compared to traditional data sources, Twitter data benefits are less expensive, could support Spatio-temporal analysis, track population health outcomes and outbreaks, track health behavior and attitude, and collect real-time data. However, in surveillance, three basic information of person, place, and time should be available for analysis. Twitter, in this case, could provide the place and time data, but many arguments doubted the availability of the "person" data and demographic representativeness of the data.
Content: The missing information on gender, age, and lack of socioeconomic status in Twitter data are the main reasons for skepticism in Twitter data as a surveillance tool. The advanced methods in machine learning and artificial intelligence could answer the skepticism. The M3 model (multimodal, multilingual, and multi-attribute), combining profile image, username, screen name, and biography of Twitter users to predict gender, showed 91.8% accuracy to predict gender. The language detection method, which detects words in Twitter posts using lexicon from the World Well-Being Project (WWBP), combined with metadata such as number of followers, number of friends, and tweeting frequency, results in 74% accuracy in predict age groups. Twitter users' social network analysis method found that users tend to have homophilic behavior when selecting friends. Combining the social network analysis, other data (language analysis and user profile), and machine learning method could predict the occupational class (accuracy of 52%) and income information of users. Twitter, as a big data source, has potentials for health and diseases surveillance. As long as the demographic profiles of Twitter data are reported, it will be valid for surveillance purposes. Furthermore, the growth of advanced methods in artificial intelligence possibly diminishes current skepticism in the usability of Twitter for surveillance and other public health activities.