In the early stages of a visualization project, we often start with two interrelated questions: Where can I find reliable data? What does this data truly represent?
Information does not magically appear, people collect and publish data, with explicit or implicit purposes. When working with data, pause to and to inquire more deeply about “whose stories are told?” and “whose perspectives remain unspoken?”, favor evidence-based reasoning over less-informed alternatives.
What is the question you are seeking to answer with data? Write it down, in the form of a question, to clarify your own thinking and to clearly communicate it to others who can assist you. Look back at the data visualization projects that made a lasting impression on you to identify the underlying question that motivated them. Also, it is perfectly normal to revise your question as your research evolves.
What Types of Organizations May Have Collected or Published the Data You Seek?
If a governmental organization may have been involved, then at what level: local, regional, state/provincial, national, or international? Or if the data you seek has been compiled by a non-governmental organization, such as an NGO, academic institutions, journalists, or for-profit corporations? Figuring out which organizations may have collected and published the data can help point you to the digital or print materials they typically publish and the most appropriate tools to focus your search in that particular area.
What Level(s) of Data are Available?
Is information disaggregated into individual cased or aggregated into larger groups? Smaller units of data allow you to make more granular interpretations, while larger units can help you identify broader patterns.
Have Prior Publications Drawn on Similar Data?
Some of our best ideas begin when reading an article or book that described its source of evidence, and we imagine new ways to visualize that data. We might stumble across a data table in a print publication, or an old webpage that sparks our interest in tracking a newer version to explore further. Even outdated data helps by demonstrating how someone previously collected and worked on it.
What if No One Has Collected the Data You are Looking For?
Researching for data involves much more than googling keywords. Deepen your search by reflecting on the questions that will help you in collecting the data.
Public and Private Data
When searching for data, you need to be informed about public and private data.
The first question is: “To what extent should anyone be allowed to collect data about private individual?”
Due to the rise of digital commerce, powerful technology companies own data that you and I consider to be private:
— -> Google knows what words you searched for. Also, Google’s Chrome browser tracks your web activity through web activity through cookies.
— -> Amazon records your conversations with its Alexa home assistants.
— -> Facebook follows which friends and political causes you favor, and tracks your off-Facebook activity, to improve its targeted advertising.
In some cases, the “big data” collected by large corporations can offer public benefits. For example, Apple shared its aggregated mobility data collected from iPhone users to help public health officials compare which population stayed home rather than travel during Covid-19. In other cases, corporations are largely setting their own terms for how they collect data and what they can do with it. Consumer Privacy Act in 2020, which promises to allow individuals the right to review and delete the data that companies collect about them, US state and federal government has not fully entered this policy arena. If you work with data that was collected from individuals by public or private organizations, learn about these policies and make wise and ethical choices about what to include in your visualizations.
The second question is: “When our government collects data, to what extent should it be publicly available?”
In most cases, individual-level data collected by US federal and state governments is considered private, except in cases where our governmental process has determined that a broader interest is served by making it public. Here’s two cases where US federal law protects the privacy of individual-level data:
— -> Patient-level health data is generally protected under the Privacy Rule of the Health Insurance Portability and Accountability Act.
— -> Similarly, student-level education data is generally protected under the Family Educational Rights and Privacy Act.
On the other hand, here are three cases where government has ruled that the public interest is served by making individual-level data widely available:
— -> Individual contributions to political candidates are public information in the US Federal Election Commission database, and related data-bases by nonprofit organizations.
— -> Individual property ownership records are public, and increasingly hosted online by many local governments.
— -? Individual salaries for officers of tax-exempt organizations are public, which these organizations are required to file on Internal Revenue Service (IRS) 990 forms each year.
Social and political pressures are continually changing the boundary over what types of individual-level data collected by government should be made publicly available. Everyone who works with data need to be informed about key debates over what should be public or private.
Mask or Aggregate Sensitive Data
Even if individual-level data is legally and publicly accessible, each of us is responsible for making ethical decisions about if and how to use it when creating data visualizations. When working with sensitive data, some ethical questions to ask are:
— -> What are the risks that publicly sharing individual-level data might cause more harm than good?
— -> Is there a way to tell the same data story without publicly sharing details that may intrude on individual privacy?
Every situation is different and requires weighing the risks of individual harm against the benefits of broader knowledge about vital public issues. Let’s see how to use masking and aggregating individual-level data, while working with sensitive information.
For example, you are working on crime data and wish to create an interactive map about the frequency of different types of 911 police calls across several neighborhoods. In many US states, information about victims of sexual crimes or child abuse (such as the address) is considered confidential and exempt from public release. But some police departments publish open data about calls with the full address for other types of crimes, in a format like this:
| Date | Full Address | Category | |Jan 1 | 1234 25th Ave | Aggravated Assault |
While this information is publicly available, it’s possible that you could cause some type of physical or emotional harm to the victims by redistributing detailed information. One alternative is to mask details in such sensitive cases, like hiding the last few digits of street addresses, while showing the general location, in a format like this: | Date | Full Address | Category | |Jan 1 | 1YYY 25thAve | Aggravated Assault |
Another strategy is to aggregate individual-level data into larger groups. In the above example, you can group individual 911 calls into larger geographical areas, such as area names, in a format like this:
| Neighborhood | Crime Category | Frequency | | East Side | Aggravated Assault | 13 |
Aggregating individual-level details into larger, yet meaningful categories, is also a better way to tell data stories about the bigger picture.
Now, let’s see how to explore datasets that governments and nongovernmental organizations have shared with the public.
Open Data Repositories
Open data repositories often include these features:
— -> View and export: At minimum, open data repositories allow users to view and export data in common spreadsheet formats such as CSV, XLSX.
— -> Built-in visualization tools: Several repositories offer built-in tools for users to create interactive charts or maps on the platform site.
— -> Applications programming interface (APIs): Some repositories provide endpoints with code instructions that allow other computers to pull data directly from the platform into an external site or online visualization. When repositories continuously update data and publish an API endpoint, it can be an ideal way to display live or “almost live” data in your visualization.
Source Your Data
Once you find the data, write the source information inside the downloaded file or a new file you create. Add key details about its origins, if possible do this in two places — the spreadsheet filename and a source notes tab. The first step is to label every data file that you download or create. Avoid bad filenames like: — -> dat.csv — -> file.ods —-> download.xlsx Write a short but meaningful filename like: —-> netflix_titles.csv — -> census2010
The second step is to save more detailed source notes about the data on a separate tab inside the spreadsheet. Add a new tab named “notes” that describes the origins of the data, a longer description for any abbreviated labels, and when it was last updated. Add your own name and give credit to collaborators who worked with you.
A third step is to make a backup of the original data before cleaning or editing it. These simple backup strategies help you avoid making non fixable edits to your original data. Make a habit of using these three strategies: filenames, notes, and backups, to increase the credibility and replicability of your data visualizations.
Recognize Bad Data
One key step while working with any data is to first open the file and quickly scroll through the content and look for any warning signs that it might contain “bad data”. If you fail to catch a problem in your data at an early stage, it could lead to false conclusions and diminish the credibility of your data visualization. Watch out for the following: — -> Missing values: If you see blank or null entries, does that mean data was not collected? Or maybe a respondent did not answer? If you are unsure, find out from the data creator. Also beware of using a 0 or -1 to represent a missing value without thinking about its consequences on running spreadsheet calculations, such as SUM or AVERAGE.
— -> Missing leading zeros: Be cautious while converting a column of zip code to numerical data, it will strip out the leading zero and the resulting data would be incorrect.
— -> 65536 rows or 255 columns: These are the maximum number of rows supported by older-style excel spreadsheets, or columns supported by Apple Numbers spreadsheet, respectively. If your spreadsheet stops exactly at either of these limits, you probably have only partial data.
— -> Inconsistent date formats: For example, September 29, 2023, is commonly entered in the US as 9/29/2023, while people in other locations around the world commonly type it as 29/9/2023. Check your source for the right format.
— -> Dates such as January 1st 1900, 1904, or 1970: These are default timestamps in Excel spreadsheets and Unix operating systems, which may indicate the actual date was blank or overwritten.
— -> Dates similar to 43891: When you type March 1, 2020 into Excel, it automatically displays as 1-Mar, but is saved using Excel’s internal date systema s 43891. If someone converts this column from date to text format, you’ll see Excel’s five-digit number, not the dates you’re expecting.
In addition to the above, beware of bad data due to poor geocoding, when locations have been translated into latitude and longitude coordinates that cannot be trusted.
What should you do when you discover bad data in your project? Sometimes small issues are relatively straightforward and do not call into question the integrity of the entire dataset. Larger issues can be more problematic. Follow the source of your data stream to try to identify where the issue beagn. If you cannot find and fix the issue on your own, contact the data provider to ask for their input. If they cannot resolve the issue, then you need to pause and think carefully. In this case, is it wiser to continue with the problematic data and add a cautionary note to readers, or should you stop using the dataset entirely and call attention to its underlying problems? These are not easy decisions, and you should ask for opinions from colleagues. Never ignore the warning signs of bad data.
Finally, you can help prevent bad data from occurring by following the key steps, outlined above. Give meaningful names to your data files and add source notes in a separate tab about when and where you obtained it, along with any definitions or details about what it claims to measure and how it was recorded. Explain what any blanks or null values mean and avoid replacing those with zeros or other symbols. Always watch out for formatting issues when entering data or running calculations in spreadsheets.
Comments