Documenting Data in Practice

In this lesson you will learn

  • How to create templates for different types of documentation
  • What functions particular types of documentation can serve
  • What “metadata” are, and how they relate to data documentation

Initial questions

  • How will you organize the documentation you create for your data?
  • Have you ever wondered how library catalogs and search engines actually find stuff?

Examples of Data Documentation

Some examples help illustrate how you can effectively document different types of data, and how doing so helps you to assess the quality and evidentiary value of the data. These examples are neither exhaustive nor prescriptive. Different research projects and different types of data will benefit from different types of and formats for documentation. You should adapt the examples we offer to fit your project and your data.

Archive Research Logs

As you visit archives, you’ll review lots of materials. A research log is an easy way to make sure you keep track of what materials you have requested / reviewed, where you found them, and whether and how those materials are relevant to your project.

There is no single “right way” to keep a research log. You can use a simple Word template or an Excel spreadsheet, or you can adapt software such as Zotero ( as described here). What is important is that you develop a system that you find easy to use and stick to using that.

To offer an example, such a log might include information at two levels – the archive level and the item level – and could have the following types of information:

  • Archive level
    • Date(s) on which you visited the archive
    • How you searched its holdings
    • Which boxes and folders you requested/consulted
  • Item level
    • Item name and/or code
    • Box / folder information
    • Date reviewed
    • If / how you captured content (e.g., digital photos or scans, photocopy, notes)
    • Relevance to your project

If you are able to create digital images of archival materials, be sure to follow a consistent file and folder naming strategy that allows you to easily shift back and forth between your digital images and your research log. Some of this item-level identifying information (e.g., name, box, folder) should be included in file names. (Alternatively, you develop a coding system for these items, and include that in, or use it as, the file name.) Each folder could represent a particular day in the archives, or your folders could mimic the archive’s organizational structure. See this excellent blog post by Donna Campbell for additional suggestions on effective archival research.

Keeping track of this information will help you to avoid duplicating effort (yourself, or within your research team). Doing so will help you to correctly reference the materials you consulted as you write up your research down the line. Finally, keeping a log allows you to present a list of all of the materials you consulted, helping to substantiate a claim that certain information cannot be found in the records. We strongly suggest that you create your log as you go, keeping it as current as possible.

Documentation for Interactive Data Collection

Many researchers who generate their own qualitative data do so through interacting with human beings – engaging in ethnographic work, conducting interviews, holding focus groups, and so on. No matter whether you conduct three interviews or embed yourself in a field site for a year, you need a way to keep track of the people with whom you interact, the content of those interactions, and your observations about those interactions.

We discuss transcribing audio / video recordings of such interactions in the lesson on Transforming Data. Here we consider creating “informal metadata” for such interactions: logging practical information, and your observations and reactions. You can include this information as part of the record of the content of the interaction – as a header at the beginning, and/or notes at the end, of your transcription of or notes from the interaction. We strongly suggest that you create these metadata as soon after the interaction as possible.

Creating this documentation serves two purposes. First, transcripts and notes from interactive data collection are much richer sources of data when you can vividly remember the exchange: being able to “re-attach” the sights and sounds and emotion of an interaction to a document containing its text helps you to contextualize and interpret the data. Second, recording such metadata makes it easier for you to assess your confidence in, and the evidentiary value of, the data, and thus more effectively and appropriately deploy them as support for claims and conclusions in your research product.

With regard to practical information, the “metadata” for each of your interactions could include information such as:

  • How each respondent was identified / selected
  • Code (number) or pseudonym for each respondent
  • Date of interaction
  • Start time of the interaction
  • End time of the interaction
  • Location
  • Language
  • Format of exchange (e.g., semi-structured in-person interview; focus group)
  • What you promised to each respondent with regard to confidentiality through the process of soliciting their informed consent to participate in your study
  • Whether you took notes, recorded, or both
  • Suggestions made about your study / other human participants
  • Any follow-up that needs to be done

With regard to observations and reactions about the exchange, your “metadata” could describe, for example:

  • The context of the exchange
    • E.g., “the very tall congressman’s very messy office, with overflowing bookshelves, three desks, and a view to the capitol out the window behind where he sat”
  • The overall rapport or tone of the exchange
  • Particular points in the exchange when the respondent(s) displayed any sort of emotion (sadness, exuberance, stress, anxiety, frustration, etc.)
  • Particular points in the exchange at which you were not sure the respondent(s) was/were revealing the full story; could not recall the full story; or may have been inventing things
  • The level of access the respondent(s) had to what they were discussing or describing
    • E.g., whether they had first-hand knowledge of the events or phenomena being described or had heard about them second- or third-hand.
  • Any information you have (perhaps from other human participants) regarding the respondents’ track record in terms of reliability
  • Key take-away points from the exchange


Creating Templates

  1. Create a template for an archive log, or for “informal metadata” for a form of interactive data collection in which you anticipate, or are, engaging. Make sure to think hard about what information particular to your project you want to systematically collect. Then choose one of the options below:
    • If you are working independently, consider whether, realistically, you will use your template – is it so complicated that you are unlikely to stick with it? Also, imagine yourself six to nine months in the future trying to remember the data-collection context. What is your template missing that would help you re-conjure that context?
    • If you are taking this course at the same time as someone else whom you know, exchange templates, and critique each other’s. What do you find particularly valuable / useful about your partner’s template? What is missing? What is extraneous?
  • show solution
    1. There is no “solution” to this exercise. As you look at your (or your partner’s) templates, consider the following: (1) will it be easy to use during your research? (2) will you be able to find relevant information quickly and easily? (3) will the information in your template help someone else to understand your data? (4) what other criteria are important?

    From Documentation to Metadata

    Metadata are, as the name suggests, data about data. Another way to think of them is highly structured documentation. You interact with metadata in your research (and beyond) all the time:

    • When you search a library catalog for works by a specific author, you use MARC (MAchine-Readable Cataloging) metadata, generated by a librarian.
    • When you post an article to social media and it appears with a small image and a short description, this relies on “Open Graph” metadata, embedded into the webpage you’re posting.
    • When you search Google scholar for an article, you rely on a mix of metadata (“Highwire”) embedded by the publisher in the article’s webpages and Google’s automatic processing of that metadata and additional information it retrieves from the article.

    To quickly reassure you – no one expects you to learn about or create metadata in the formats discussed above; indeed, no one does. Just like libraries generate the metadata for books used in library catalogs, data repositories and their curators generate the metadata for your data once you deposit it. There are two ways repositories do this:

    1. All repositories prompt you to input some information about your data project and your data in different “fields” of the deposit form when you are depositing your digital data. These fields are then automatically mapped to relevant metadata formats.

    2. For data repositories that curate your data (i.e., domain repositories and some institutional repositories), curators will work to improve the metadata for your project by ensuring that categories are filled out correctly, by soliciting additional information, and by enhancing existing information, e.g., by adding systematic keywords.

    The most important metadata format in the social sciences is the “Data Documentation Initiative” (DDI). DDI provides detailed categories for describing most social science studies. Originally designed for survey research, its most distinctive feature is the ability to include variable-level metadata. Using DDI metadata, it is thus possible to search the catalogs of repositories such as ICPSR for specific variables.

    While creating metadata is not your responsibility, the better and more complete your data documentation is, the better and more useful the metadata that can be created based on that documentation. The quality of your data’s documentation and metadata matter in several ways. Well-structured metadata make data more findable. The availability of detailed information about how data were generated make them more trustworthy for secondary users. The availability of background information on the project and the data make the data more understandable.


    Comparing Documentation / Metadata

    1. Compare these data on figshare with these data on QDR. How does the available documentation / metadata differ? How do these differences matter?
    • show solution
      1. You likely noticed that the QDR project includes more detailed information about the content and generation of the data and more detailed metadata. Here are some specific differences:
        • You can tell when the data were collected and where.
        • You can follow the logic of the social media excerpts Clarke has collected. On the other hand, can you follow the meaning of the “Subject” codes in the first column of the Patel et al. data?
        • The detailed information about the generation of the data on QDR mean you can better decide on whether and how much to trust the data.