Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors

How social scientists elicit truthful responses on sensitive matters

Understanding women’s healthcare-seeking involves examining attitudes, behaviors, and preferences. In traditional surveys with direct questions, respondents may choose to conceal their true opinions. Researchers employ various techniques to address this issue.

18 min.

In recent blogs on The Care Gap, largely focused on the demand side of women’s healthcare, we explore how decisions around the seeking of healthcare for women are made within households. It is seen that various factors come into play – gendered norms and their internalization by women, male backlash to women’s empowerment, bargaining power of different members of the household, religious beliefs, stigmas and taboos, and so on. However, these are sensitive topics and it is not straightforward to obtain honest responses in traditional surveys with direct questioning. Respondents may answer in accordance with what they believe would be viewed more favorably by others rather than on the basis of their actual beliefs, or simply choose not to respond. In this blog, we focus on “social desirability bias” in social science research, and some of the key ways in which researchers seek to measure attitudes and preferences accurately.

 

Social desirability bias

Based on a review of over 30 years of research on sensitive issues, Tourangeau and Yan (2007) demonstrate that misreporting is a major source of bias in surveys: only about half of abortions performed (data from clinics) are reported in the National Survey of Family Growth, while over 20% of non-voters (voting records) reported that they had voted in surveys of the American National Election Studies. There are similar trends of underreporting in smoking among teens, drug use, alcohol consumption, criminal behavior, or racist attitudes, and overreporting in church attendance, exercise, or library card ownership. Importantly, the misreporting is systematic and not random – for instance, very few people who test negative for drug use claim otherwise. Further, errors are mainly in the socially desirable direction rather than in both directions, as would occur if these were related to memory or cognition.

 

Untrue answers combined with non-response can undermine the credibility of self-reported measures used in empirical research, leading to inaccurate inferences. Researchers have developed methods to address “sensitivity bias”1, but each such approach comes with costs in terms of development and testing, survey duration, and statistical power. Researchers ought to determine whether sensitivity bias is likely to be a problem, and if so, how best to address it. Blair et al. (2020) outline four necessary conditions for sensitivity bias: (i) A social referent that the respondent has in mind when considering their response to a survey question one or more people, organisations, or the respondent themself. (ii) Respondent perception that the social referent can infer their response either exactly or approximately. (iii) Respondent perception about what response (or non-response) the social referent prefers. (iv) Respondent perception that failing to provide the response preferred by the social referent would entail costs to themselves or other individuals/groups, which may be social (embarrassment), monetary (fines), or physical (jail time or personal violence). 

 

List experiments

List experiments or item count techniques are widely used to address social desirability bias in research. Blair and Imai (2012) explain the method by describing the 1991 National Race and Political Survey the first use of list experiments in political science. The sample of respondents are randomly allocated to ‘treatment’ (subject to intervention) and ‘control’ (not subject to intervention) groups. A list of items are read out to the control group, noting that these are things that may evoke anger or other negative emotions, such as “large corporations polluting the environment”. The respondents are asked to think about which statements upset them, following which they are required to count those statements and only provide the total number to the interviewer. The same process is followed for the treatment group, except that the sensitive item of interest – for example, a black family moving next door to you – is added to the list. The difference in the average responses between the two groups is used to estimate the proportion of respondents who are upset by the sensitive item.2 However, within this basic application, it is not possible to assess the link between respondent characteristics and their attitude towards the sensitive item in question (Gingerich 2012)

 

For list experiments to work in the way intended, two assumptions must be satisfied (Blair and Imai 2012). The first is that respondents are not lying about the sensitive item while aggregating their responses, and second, there are no design effects, that is, the addition of the sensitive item to the list does not change the way the respondents engage with the other items. Another aspect of design that warrants attention is “floor” and “ceiling” effects. These are created when respondents agree with none of the statements or all of the statements, revealing their attitude towards the sensitive item of interest (Glynn 2013)

 

Kramon and Weghorst (2019) also draw attention to some limitations of list experiments, and recommend simple modifications to improve their performance. While list experiments may reduce response bias, they can introduce other forms of errors on account of the length and complexity of the question format. In their Kenya study involving the measurement of support for political violence, they observe that responses are inconsistent across direct questions and list experiments even for non-sensitive items like daily activities – particularly among those less educated. Further, they find lower rates of support for political violence under list experiments vis-à-vis direct questioning, indicating that the difficulty of the technique is leading to its failure. The researchers are able to improve the use of list experiments by providing a way (say, a whiteboard that can be erased at the end of the exercise) for respondents to privately tick along each item and aggregate with greater ease. The other way, especially helpful for those not literate, is to provide visual aids for participants to look at, while the enumerator reads out the list of items. 

 

Blair et al. (2020) advise that the choice between list experiments and direct questions ought to be made based on the trade-off between the magnitude of sensitivity bias and the loss of precision that comes with list experiments. Under typical conditions, list experiments are about 14 times ‘noisier’ than direct questions. If the bias is large enough to warrant a list experiment, attempts should be made to have a large sample of respondents for better estimation. 

 

In empirical analysis in general, the employment of multiple measurement strategies helps enhance credibility of conclusions. This may be even more imperative when using indirect techniques to measure sensitive attitudes given that, by design, these are meant to elicit lesser information and as discussed, measurement errors may creep in. In their Afghanistan study, Blair et al. (2014) demonstrate that endorsement experiments reveal patterns similar to list experiments, even in challenging research environments. They measure civilian attitudes towards the NATO-led mission an organization spending millions in the country but one that is viewed as an occupying army by many locals. On the one hand, there are incentives to continue receiving their support and on the other, there is a risk of backlash from neighbors or insurgent organizations. Under the endorsement experiment, the control group respondents are asked to rate their level of support for particular policies. The same is done for the treatment group, except that the policies are said to be endorsed by an actor of interest. The difference in responses between the two groups is interpreted as support or lack thereof for the actor. 

 

Lab in the field

Also known as artefactual, framed, or extra-lab experiments, lab-in-the-field experiments harness the benefits of randomization in an environment that captures important features of the real world (Gangadharan et al. 2022). These may be utilized in contexts like discrimination, dishonesty, or fairness, and can complement data from field surveys. These experiments are less likely than traditional surveys to elicit socially desirable behavior because the participants may not know the true aim of the experiment. Further, the design often includes financial incentives associated with the choices made by the participants. This helps link choices within the experimental setting to behavior in the real world as the latter often involves payoff consequences. Monetary payments also encourage participants to take their decision seriously and consider it relevant, making them experience real emotions. These factors make the participants more likely to reveal their true preferences than in the case of traditional surveys.

 

For instance, Krupa and Weber (2013) use a coordination game to measure social norms. Participants are asked to guess what they think others believe is socially appropriate in a given context, say, whether it is socially appropriate for men to be homemakers. The reason this works is that the subjects do not have to reveal what they believe themselves. The researchers can also vary the reference group, allowing them to elicit what respondents think females believe as compared to what males believe, and so on. 

 

However, certain factors are to be taken into account when extrapolating lab results to the real world. Levitt and List (2007) provide a checklist: (i) The fact that subjects in lab experiments are aware that their behavior is being monitored, recorded and analyzed, may exaggerate the importance of pro-social behaviors relative to environments without such scrutiny. (ii) The experimenter is able to link choices to the identity of the subject and this lack of anonymity may be associated with greater pro-social behavior. (iii) A complex set of contextual factors influence the actions taken by people and while the experimenter may be able to control some aspects (payoffs, how the game is described, etc.), the participants’ past experiences and internalized norms can also affect outcomes. (iv) In experiments that have both a morality and wealth component, it is important to account for the role played by the size of the stake. (v) Study participants may differ in systematic ways from the actors of interest in real-world settings. For example, several lab experiments are conducted with students who ‘self-select’ into the experiment. (vi) In experiments, researchers define the set of actions that the subject is allowed to take and response modes are restricted to a single dimension. However, in the real world, the choice set is often almost unlimited and there are multiple response modes. For instance, in a game, an agent who is inclined to help others may give money, while in the real world, they can be generous in other ways such as by volunteering their time. (vii) Lab experiments usually only last for at most a few hours, and it is known that behaviors vary in short- and long-run decision-making. 

 

According to Levitt and List (2007), theory permits us to take results from one environment and make predictions in another environment, and the same is applicable to generalizing evidence from lab experiments. They call for a model of lab behavior to describe the data-generation process and how it may relate to other contexts. In the lab-in-the-field methodology, the key is to anticipate the type of bias that may occur, and its sign (underestimation or overestimation of a behavior) and the plausible magnitude. Such an analysis can help the researcher in tweaking the design of experiments to minimize bias, and to interpret findings in a meaningful way. Even if empirical findings are believed to have limited generalizability, qualitative insights can improve understanding of the issue and the underlying mechanisms. 

 

The interpretation of findings from artefactual experiments is well-illustrated by Mani (2019). The study explores the impact of a social norm that ‘a man should not earn less than his wife’, on household investment efficiency. In the experiment, spouses are asked to make individual investment decisions under four scenarios that vary how much control they would personally have over the realized household income. It is seen that when husbands are assigned a fixed share of household income that is smaller than that of the wife’s, they make less efficient investment decisions – that is, they are willing to undercut their own income to ensure that the wife does not earn too much more than them. In terms of the experiment design, it is noted that the participants’ actions can be observable by the experiment coordinators and hence, they may be more likely to behave in a way that the coordinators would approve of. In a context where norms are such that people would approve of a family with little domestic conflict, the participants would act more efficiently. Hence, the outcomes should be interpreted as a ‘lower bound’ on the financial inefficiency that may exist in real-word households, driven by social conditioning. 

 

Randomized response surveys

The core idea of randomized response surveys is that random ‘noise’ is introduced such that individual responses cannot be identified and respondent privacy is protected; yet, the researcher is able to obtain accurate estimates at the population level. A type of randomized response survey is the forced response design, which works as follows. The respondent is asked to secretly roll a dice. If they roll 1, the answer is considered as “no”, regardless of the truth. If the outcome is 6, the answer is taken as “yes”, regardless of the truth. If they roll 2-5, they are asked to provide their truthful answer. The interviewer only observes “yes” or “no” but cannot know if it was an automatic response or a real one. Subsequently, the known probabilities of die outcomes are used to uncover the proportion of the population that responded a certain way to the question. 

 

Gingerich (2010) presents an example of using the randomized response method to measure vote-buying prevalence among 100 legislative candidates. Each candidate gets a questionnaire, which is to be completed privately, and a coin. They are asked to privately toss the coin. If the outcome is heads, they are asked to mark “yes” on the questionnaire whether or not they had engaged in vote buying. In the case of tails, they are asked to give their honest response. Now, suppose 62 of the 100 candidates marked “yes”. It is known that “yes” was expected in 50% of the responses. The remaining affirmative responses, 12 in number, are coming from the truthful tails group. Hence, it may be concluded that the prevalence of vote buying is 24%. 

 

The method is applied to a survey of about 3,000 public employees across institutions in Bolivia, Brazil and Chile, to examine whether political career ambition causes misuse of institutional resources. Using a spinner-based design produced a non-response rate of only 4.6%, versus 68-95% under direct questioning in a corruption survey in Cambodia. 

 

However, the method is more complicated than list experiments and it may be relatively harder for participants to understand how the anonymity of their responses is being ensured (Blair et al. 2020)

 

Marlowe-Crowne social desirability scale

Noting that social desirability bias is a key concern with self-reported outcomes such as gender attitudes, Dhar et al. (2022) leverage the Marlowe-Crowne social desirability scale (MC-SDS) (Crowne and Marlowe 1960). The context is an intervention in secondary schools wherein adolescents are engaged in classroom discussions about gender equality, and prompted to reflect on their own and society’s views on the matter. The researchers then seek to evaluate the impact of the intervention on gender attitudes. However, a concern is that participants in the treatment group are expected to be even more susceptible to social desirability bias on account of ‘experimenter demand effects’. This means that participants may disingenuously express more gender-progressive views to present themselves in a positive light to the surveyors.3 

 

To address the issue in a rigorous manner, the researchers use the MC-SDS – a survey module developed by social psychologists to measure a person’s propensity to give socially desirable answers. In the baseline survey of the schools study, a module4 was included that asked respondents about several too-good-to-be-true personality traits that people are unlikely to actually have, such as never being jealous of another person’s good fortune, always being a good listener, or never being irritated by people who ask favors of them. Those who report more of these traits are scored as having a higher tendency to give socially desirable answers. However, a caveat is that some of the variation might reflect real differences in possessing these desirable traits.

 

The analysis shows that respondents with a higher social desirability bias express more support for gender equality similarly across treatment and control groups. The positive effects of the intervention on self-reported attitudes and behaviors are similar in magnitude for those with a low versus high propensity for social desirability bias. It is noted that techniques such as list experiments need to focus on a narrower set of outcomes, while the MC-SDS allows for bias testing for any and all outcomes. 

 

Vu et al. (2011) tested the reliability of the MC-SDS in the context of evaluation of HIV prevention programs, which typically rely on self-reported surveys of HIV knowledge, attitudes, and sexual practices. Their study is based in four African countries Ethiopia, Kenya, Mozambique, and Uganda. The researchers conclude that MC-SDS is a reliable instrument to assess the effect of social desirability when soliciting information related to sensitive matters such as sexual behaviors in self-reports, and can help improve the interpretation of data. It can effectively be adapted and implemented in diverse settings in sub-Saharan Africa.

 

Research on women’s healthcare-seeking behaviors can utilize the types of methods described here, drawing on studies conducted on other sensitive topics. This can help disentangle the various influences on the demand for women’s healthcare, and create evidence to aid the design of targeted interventions. 

 

In our next blog in December, we turn to the supply side of healthcare – examining whether there are elements of health systems that keep women away, such as women being taken less seriously by medical personnel relative to male patients. 

 

FOOTNOTES


 

1 Blair et al. (2020) label a subset of social desirability bias as sensitivity bias.

 

2 An alternative way to implement the method is to embed the sensitive item within a list of non-sensitive items for the treatment group, with the difference in the average responses between the two groups being attributable to the sensitive item (Kramon and Weghorst 2019).

 

3 Such an effect is likely to fade out over time. Indeed, in this study, it was seen that the impact of the intervention on gender attitudes persisted two years after the program ended, providing reassurance that the change in attitudes was genuine. However, such longer term follow-up may not always be feasible.

 

4 The researchers implement a 13-item version of the original 33-item module, and combine the responses into an index or social desirability score.

 

 

REFERENCES


 

Blair, G., & Imai, K. (2012). Statistical analysis of list experiments. Political Analysis, 20(1), 47–77. 

 

Blair, G., Coppock, A., & Moor, M. (2020). When to worry about sensitivity bias: A social reference theory and evidence from 30 years of list experiments. American Political Science Review, 114(4), 1297–1315. 
https://onlinelibrary.wiley.com/doi/10.1111/ajps.12086 IDEAS/RePEc

 

Dhar, D., Jain, T., & Jayachandran, S. (2022). Reshaping adolescents’ gender attitudes: Evidence from a school-based experiment in India. American Economic Review, 112(3), 899–927. https://www.aeaweb.org/articles?id=10.1257/aer.20201112 UNGEI+1

 

Gangadharan, L., Jain, T., Maitra, P., & Vecci, J. (2022). Lab-in-the-field experiments: Perspectives from research on gender. The Japanese Economic Review, 73(1), 31–59. https://link.springer.com/article/10.1007/s42973-021-00088-6
SpringerLink+1

 

Gingerich, D. W. (2010). Understanding off-the-books politics: Conducting inference on the determinants of sensitive behavior with randomized response surveys. Political Analysis, 18(3), 349–380. https://www.jstor.org/stable/25792017 Cambridge University Press & Assessment+1

 

Glynn, A. N. (2013). What can we learn with statistical truth serum? Design and analysis of the list experiment. Public Opinion Quarterly, 77(S1), 159–172. https://academic.oup.com/poq/article/77/S1/159/1878470 OUP Academic

 

Kramon, E., & Weghorst, K. R. (2019). (Mis)measuring sensitive attitudes with the list experiment: Solutions to list experiment breakdown in Kenya. Public Opinion Quarterly, 83(S1), 236–263. https://academic.oup.com/poq/article/83/S1/236/5525050 OUP Academic+1

 

Krupka, E. L., & Weber, R. A. (2013). Identifying social norms using coordination games: Why does dictator game sharing vary? Journal of the European Economic Association, 11(3), 495–524. https://academic.oup.com/jeea/article-abstract/11/3/495/2300029 OUP Academic+1

 

Levitt, Steven D., and John A. List. 2007. “What Do Laboratory Experiments Measuring Social Preferences Reveal About the Real World?” Journal of Economic Perspectives 21(2): 153–174. https://www.aeaweb.org/articles?id=10.1257/jep.21.2.153. American Economic Association

 

Mani, A. (2020). Mine, yours or ours? The efficiency of household investment decisions: An experimental approach. World Bank Economic Review, 34(3), 575–596. https://academic.oup.com/wber/article-abstract/34/3/575/5717658  OUP Academic+1

 

Tourangeau, R., & Yan, T. (2007). Sensitive questions in surveys. Psychological Bulletin, 133(5), 859–883.https://psycnet.apa.org/doi/10.1037/0033-2909.133.5.859

 

Vu, A., Tran, N., Pham, K., & Ahmed, S. (2011). Reliability of the Marlowe–Crowne Social Desirability Scale in Ethiopia, Kenya, Mozambique, and Uganda. BMC Medical Research Methodology, 11, 162.https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-11-162 BioMed Central