Survey Classification

Data Generation Process Workshop

Yue Hu

Tsinghua University

Goal

What to Achieve

  • Filter relevant questions from the survey pool
  • Classify the questions into themes
  • Verify assigned themes

Question Filter

Survey Selection

  • Download the pool of surveys from the dcpotools website

  • Have a look at the downloaded spreadsheet

    • Select Freeze Panes from Window
    • Start from the cross-national surveys (country_var is empty)

Survey Achoring

  • Start from Column archive
    • First on dataverse, gesis, icpsr, roper, or ukds labels
    • If dataverse, use the data_link to achieve the questionnaire
    • If others, go to the survey website
      • Search the file_id in the search box
        • If nothing comes back, try the numeric part only
      • More clicks may be needed
        e.g., clicking “Studies” (icpsr) or “Studies/Datasets” (roper)

Codebook Downloading

Question Selection

  1. Search keywords relating to your topic
    • e.g., for gender egalitarianism, you can search for “wom,” “husband,” “wife,” etc.
  2. Look around the questions around the questions you found
  3. Look over the index of questions

Question Recording

Download the template from the DCPOtools website.

  1. survey: survey in the surveys_data spreadsheet;
  2. variable: The question index, e.g., “q56,” “v122”;
  3. question_text: The complete sentences read to the people taking the survey, or as close to that as you can find;
  4. response_categories: The number and the label of each of the options, e.g., “1. Strongly agree, 2. Agree, 3. Neither agree nor disagree, 4. Disagree, 5. Strongly disagree”.

If, you’re sure there are no relevant questions in the survey, enter the survey and put “NA” under variable, and move on.

Question Clustering

Classification

Three people per group, one group per topic.

  • Immigration
  • Inequality
  • LGPTQIA+
  1. Read the questions through (at least 1/3 each of an archieve)
  2. Categorize them into three topics
    • Using a term to represent each topic
    • Mark the topic for each question
  3. Talk with your partners to justify/modify your categorization system to be a consistent one
  4. Mark all the questions
  5. Measure the intercoder reliability (ICR) with Fleiss’ κ to determine if you need to categorize it again

Tips

Make sure recording the full sentence of the questions

According to Landis & Koch (1977), let’s aim 0.8.

A high κ is not the ultimate goal, a.k.a., no! fake! consistency!

Twice communications with your partners:

  1. After you are famililar with the data and communicate to nail down the categorization system
  2. After calculating the ICR and try to figure out the problem if the κ is low.

Make sure you record the data in the same way.

Outcome Example

Be Prepared and Good Luck