Where to Share Data

In this lesson you will learn

  • The various ways in which you can share your research data
  • About data repositories and their benefits
  • What kinds of data repositories there are
  • Some issues to consider when deciding where to share your research data

Initial questions

  • If you have already shared some of your research data, where did you do so and what were the main reasons behind your decision about where to share?
  • When someone says “data repository,” what sort of image does that conjure up for you?

Ways to Share Data

Once you’ve decided which of your research data you will share, and while thinking about how you will address any challenges that sharing your data might pose, you need to figure out how and where you will share your data. There are lots of options.

  • You could share your data on an ad hoc basis with whomever requests them – maybe just sending them to the requestor attached to an email.
  • You could post your data on your own website.
  • If the data emanated from a broader research project, you could post them on that research project’s website.
  • If the data are associated with a research article, you could include the data as supplementary material for the journal’s publisher to store.

These solutions, however, pose various types of problems.

  • Sharing your data on an ad hoc basis with people who request them means your data aren’t really accessible. Don’t you want as many people as possible to be able to learn from and reuse your data?
  • Sharing your data on your own web site is not a long-term solution: do you have the technical knowledge and resources to guarantee the accessibility and long-term preservation of your data when, for instance, your institution changes technologies or you change institutions?
  • Then there’s the issue of link rot (i.e., hyperlinks pointing to web resources that have become unavailable). To give an example of the prevalence of this problem, more than half of the reproducibility links in articles from the American Political Science Review between 2000 and 2013 couldn’t be accessed in 2016 (Gertler and Bullock 2017, 167) [ungated copy].
  • Finally, while publishers have significant experience in preserving digital publications, they rarely extend their preservation promises and practices to supplementary material and may not prioritize making such materials accessible to the broader research community.

The bottom line is, you’re a scholar! You’re equipped to generate and analyze data. You don’t have, and shouldn’t need to have, the skills necessary to properly care for and preserve research data for the long term. The same goes for journals, whose core mandate is to publish articles, not preserve data. Instead of exercising the options above, we encourage you to store your data in an institution specifically created for this purpose, i.e., in a data repository. As you’ll learn more about below, a repository is “a final destination” for data that allows for their long-term publishing and preservation. There are a few kinds to choose from, and lots of things to consider when you’re making your choice.

Exercise

Link Rot

  1. The list of the broken links that Gertler and Bullock found in the American Political Science Review (see above-cited article) is part of the data accompanying the article that they shared through Harvard Dataverse here in the file “brokenReproducibilityURLs.tab”. Download the file, open it in Excel, and find the first five links marked as “Did not find resource”. What sorts of websites are these? Can you still find the linked content using the Internet Archive’s Way Back Machine?
  • show solution
    1. You’ll have noted that the broken links lead to all sorts of different websites: personal, institutional, even government websites. In some cases (such as Erik Voeten’s Princeton page and the Polish Center for Public Opinion Research CBOS), the Way Back Machine is not able to find any archived copy. You may still be able to track down this information by looking at the current sites – or you may not. For other cases, you can access the webpage, such as Richard Tucker’s page at Vanderbilt, but the actual data aren’t archived there. For the Correlates of War project, you are able to download the full data from the Internet Archive. You could also go to the project’s current site, but would you be able to find the version of the data corresponding to the 2000 article there? The internet offers you powerful tools to retrieve old information, but even with those tools, the content behind some links – containing key parts of the scholarly record – may be lost forever. You should have strategies for tracking down such lost data, but more importantly, you should make sure that the same fate does not befall your data and links. The best way to do this is to deposit your data in a repository.

    What Data Repositories Do

    Most data repositories curate research data and preserve them for the long term. Curation is an umbrella term that describes a set of activities: managing, maintaining, validating, and adding value to research data. Curation increases the value and quality of data as a research product, augmenting their understandability and usability by the scientific community. Preservation refers to protecting and prolonging the integrity of research data and their associated metadata for the long term. Data repositories are generally stable over time, helping to ensure the long-term preservation and security of your data.

    Given the characteristics of qualitative research data, and the practical, ethical, and legal challenges that sharing such data sometimes poses, even if you carefully manage your data, they often need to be curated by experts in order to be findable by and accessible to other scholars. The staff of repositories that often work with qualitative data are familiar with and alert to the special challenges posed by sensitive data, and help depositors to make decisions on how to share their data ethically and legally. In the next section we discuss types of data repositories. While not all repositories carry out each of the tasks listed below, most full-service repositories offer the following services:

    • Work to make your data easier for other scholars to discover and access
      • Repositories are fully searchable and often indexed broadly.
      • Repositories assign digital object identifiers (DOIs) to data.
        • A DOI is a unique, persistent character string that can be assigned to any digital object (e.g., a dataset, a 15-second interview, or a piece of documentation). It’s basically a tag that remains fixed over the lifetime of the digital object. Referring to an online document by its DOI provides more stable linking than simply referring to it by its URL.
    • Work to make your data easier for other scholars to use and cite
      • Repositories publish documentation and other materials that facilitate the interpretation and re-use of data.
      • Repositories develop and standardize metadata, and associate relevant and important metadata with your shared data.
        • Metadata are structured information describing your data. Some examples of metadata that might be associated with your data set in a repository are your name, the dates during which data were collected, the title of the dataset, and other characteristics of your data.
        • Separate files in your dataset may also have additional item-level metadata associated with them.
    • Provide tools that allow you some control over who accesses, downloads, and uses your data, and how they do so
      • Repositories offer authenticated online access to data.
      • Repositories allow you to establish user-access controls for your data when these are necessary to protect confidentiality and personal privacy as required by law and the ethical standards of your research community.
    • Collaborate across disciplines to achieve interoperability among scientific communities
      • Repositories often allow the “harvesting” of their metadata through a dedicated protocol, facilitating the development of search interfaces covering many data repositories.

    Exercise

    Data Sharing SNAFU

    1. Watch this short video and write down the various types of problems created by the ad hoc sharing arrangement for which the bear on the left opted. How would these problems have been mitigated if the bear had deposited his data in a data repository? While the bears are discussing quantitative data, would the challenges be the same for qualitative data?
    • show solution
      1. Some of the problems created by the left bear’s ad hoc sharing arrangement include: (1) lots of inter-researcher communication which is inefficient; (2) the passage of a long period of time before the data are actually shared; (3) the format of the data is obsolete which temporarily prevents their use; (4) the data cannot be interpreted by the secondary user; (5) the original researcher, not having documented the data, does not remember how to interpret the data and his co-author, who may remember, cannot be located. Had the left bear deposited his data (and complete documentation) in a data repository, they would have been more quickly accessible to, and more easily interpreted by Dr. Benign (the panda); for instance, curation would have entailed updating the format. Dr. Benign would likely face similar challenges if the data were qualitative.

      Types of Data Repositories

      There are several different kinds of data repositories.

      Self-service repositories are the newest type. Most were founded after 2000. They are typically open to all research data (and in some cases other materials). Such venues probably hold the largest number of individual datasets worldwide.

      • Advantages
        • Convenience: they allow easy upload of data of any kind for all researchers.
        • Cost: both deposit and download are free of charge.
      • Disadvantages
        • Heavy reliance on the expertise and efforts of depositors.
        • Typically, deposits are either not reviewed/curated or are only minimally reviewed/curated by staff
        • Depositors are responsible for supplying cataloging information.
        • The repository neither checks that files are valid (i.e., open correctly in the specified software) nor protects against file-format obsolescence (the inability to open old files with currently available software).
      • Examples:
        • Figshare (a for-profit company)
        • Zenodo (run by CERN, the European Organization for Nuclear Research)
        • Harvard Dataverse (run by the Institute for Quantitative Social Science at Harvard University)

      Institutional repositories are generally operated by college and university libraries, or by other research institutions. They accept pre-prints, working papers and, increasingly, research data generated by affiliates of the institution with which they are associated.

      • Advantages
        • The proximity of your institutional repository makes it easy for you to contact it with questions across the data lifecycle.
        • You may have a great deal of trust in this repository right at your home institution.
        • Libraries have a lot of expertise in the preservation of digital formats.
      • Disadvantages
        • Such repositories have traditionally been more concerned with holding and making available researchers’ publications than with facilitating access to the data underlying their research.
        • Libraries (and thus institutional repositories) may lack sufficient information technology and subject-specific capabilities to provide specific curation, preservation, and dissemination guidance and services (Johnston et al. 2017).
      • Examples

      Domain repositories focus on a specific discipline or group of disciplines (e.g., “social science” or “earth science”) and provide specialized services for data commonly used in those disciplines. These types of repositories have the longest history of the ones considered here.

      • Advantages
        • Domain repositories help you deposit your data, and they then curate and preserve the data.
        • Domain repositories are commonly best-equipped to store, preserve, and provide access to sensitive data.
        • Domain repositories link data to publications in which they are used and/or cited and showcase data holdings via blogposts, press releases, or infographics.
        • Domain repositories often monitor how the data they store are used; this makes it easier for you to get credit when the data you generated are used by others.
      • Disadvantages
        • Sometimes domain repositories put data behind a paywall that may limit the number of people in the U.S. and around the world who have access to your data.
        • Sometimes domain repositories charge for depositing data. The relevant costs may be born by institutions, but sometimes you might have to pay them.
      • Examples

      Other repositories are “hybrids” – some combination of the types just mentioned.

      • Dryad began as a domain repository for bio-medical data, but now accepts data from many disciplines, and performs curation services.
      • Re-Share (UK Data Service) – a social science self-publishing repository run by and alongside a fully curated domain repository), this venue offers reduced but still significant curation work.
      • Open ICPSR (ICPSR) – a social science self-publishing repository run by and alongside a fully curated domain repository. Data published on Open ICPSR are minimally curated; a metadata review is performed after publication.

      Exercise

      Getting to Know Your Repository

      1. Find the web site of the repository at your institution and poke around a bit. (If your institution doesn’t seem to have an institutional repository, try to find the institutional repository at another institution that is similar to yours or with which you’re familiar.) Then find the web site of another kind of repository (choosing from one of the categories above) and poke around a bit. What similarities and differences do you notice?
      • show solution
        1. Your answers might concern (1) cost of accessing data or depositing data; (2) focus (on research publications vs. research data); (3) how much published guidance the repository provides for depositors; (4) whether, and what level of, curation services the data repository provides. If you tried to access data, your answers might consider (5) whether you had to become a registered user before seeing any data and (6) how easy it seems to be to find data in the repository and to deposit data.

        Evaluating Options and Choosing

        Now that you know a bit more about some of the venues in which you can share your research data, how will you decide among them? There are lots of issues to consider when making your choice. You might think about:

        • How much does it cost to deposit your data in the different venue choices?
        • How important is it to you to be able to post your data quickly and efficiently without interacting much with repository personnel, answering questions about curation, etc.?
        • How findable and accessible do you want your data to be for other scholars?
        • How much do you want and need your data to be carefully curated?
        • How sensitive are your data?
        • Will you need to place some kinds of access controls on your data?
        • How well-described and automated do you want the methods for accessing your data to be?
        • How easily and effectively do you want other scholars to be able to reuse your data?
        • How important is it to you to be able to interact face-to-face with the people who are handling your data?
        • What kinds of reputations do the different venue choices have?
        • How much do you trust the different venue choices to keep your data safe?
        • To what types of scholars do the different venue choices seem to cater, and does that match the audience with whom you would like to share your data?
        • Does the venue have Core Trust Seal certification?
        • Can the venue issue DOIs?

        It could be helpful to actually write out the answers to these questions, ideally as part of working on your data management plan, and then compare what seems to be important to you with the advantages/disadvantages of the various types of venues listed above. Do your answers seem to point you towards one type of venue or another?