Data Acquisition

With data collection, ‘the sooner the better’ is always the best answer.” — Marissa Mayer

This saying goes with the understanding that one must procure the dataset before performing analysis or building models of it.

This article provides foresight on various kinds of data and how one can navigate acquiring such datasets.

What is Data and Data Acquisition?

Wikipedia defines Data as an organized structures such as tables that provide additional context and meaning, and which may themselves be used as data in larger structures.

Data can also be defined as raw facts and figures which can be used in such a manner in order to capture the useful information out of it.

Data acquisition (also called data mining) is the process of gathering data. Ideally, we have a question in mind before we collect the data, but not always.

Sometimes data is gathered before we know what to do with it. When that happens, it is important to take a step back and define what questions can be answered with the available data.

In addition, some things to consider when acquiring data are:

What data is needed to achieve the goal?
How much data is needed?
Where and how can this data be found?
What legal and privacy concerns should be considered?

Types of Data/Data Sources

First, we explore the different types of data which includes structured data and unstructured data.

Data is information of any kind. In the context of business and computing, we’ll deal (mostly) with information that’s in a machine-readable format. This is known as structured data.

Structured data

Structured data adheres to a pre-defined data model. This model describes how data is recorded, and it defines the attributes and provides information about the data type (e.g. name, date, number) and restrictions on their values (e.g. number of characters).

This level of organisation means that data can be entered, stored, queried, or analysed by machines.

Structured data includes:

names
dates
phone numbers
currency or prices
heights or weights
latitude and longitude
word count or file size of a document
credit-card numbers
product names or numbers
transaction information.

You’ll often find structured data arranged in a tabular format (sometimes described as rectangular), where columns represent attributes (or variables) and each row represents a record.

The intersection of a column and row (usually called a cell), contains the value (or observation) about that attribute for that record.

Unstructured data

Unlike structured data, unstructured data requires human interpretation. Consider a block of text. Computers can read each word, or sentence, but they can’t (yet) determine the meaning or tone of the text without human intervention.

As you’ll discover later in the course, data scientists are trying to solve this problem with machine learning and other types of artificial intelligence.

Other examples of unstructured data include:

images (human- and machine-generated)
video files
audio files
social-media posts
product reviews
messages sent by SMS or through online services.

Some data, such as email, is considered to be semi-structured. Email headers contain metadata such as the date, language, and recipient’s email address, which are all structured data. But the email body, which contains your message, is unstructured.

Data Sources

Data can be acquired from many different sources. Broadly, they can be categorized into primary data and secondary data.

Primary data is data collected by the individual or organization who will be doing the analysis. Examples include:

Experiments (e.g., wet lab experiments like gene sequencing)
Observations (e.g., surveys, sensors, in situ collection)
Simulations (e.g., theoretical models like climate models)
Scraping or compiling (e.g., webscraping, text mining)

Secondary data is data collected by someone else and is typically published for public use. Most data you will use falls into this category.

Examples include:

Any primary data that was collected by someone else
Institutionalised data banks (e.g., census, gene sequences)

As you can imagine, collecting your own primary data can be time consuming. However, the closer you are to the data, the better you understand it and its nuances.

On the other hand, secondary data is much easier to find. Even with secondary data, understanding how the data was created is essential in order to correctly utilise and analyse it.

This includes reading any available data methodology or README files.

Data file formats

Data can come in a variety of different file formats, depending on the type of data. Being able to open and convert between these file types opens a whole world of data that is otherwise inaccessible.

Examples of file formats include:

Tabular (e.g., .csv, .tsv, .xlsx)
Non-tabular (e.g., .txt, .rtf, .xml)
Image (e.g., .png, .jpg, .tif)
Agnostic (e.g., .dat)

Further, some file formats are proprietary and can only be opened by software developed by a particular company. Opening these in another program requires converting to a universal format (this has to do with how the characters are encoded).

Proprietary formats include Excel or MS Access files that are designed to be opened by Microsoft Office applications, as opposed to more generic types like .csv files.

There are also other file formats that store metadata, such as SPSS and STATA files that contain information on data labels.

It is best practice to store data in a way that is most easily accessible for everyone. Generally, this means storing data in a non-proprietary and openly documented format, using standard character encodings (utf-8), and keeping data files uncompressed, if space allows.

There are various methods, including online tools, that can be used to convert between formats if necessary.

Where to procure data?

Primary data

Conducting research and experiments is typically out of the scope for Data Scientists, but surveys and simulations are common methods for acquiring primary data.

Web Scraping is also a special case of primary data collection by extracting or copying data directly from a website. To learn more, check out our full course on web scraping and this article demonstrating how to webscrape MLB stats using Python.

Secondary data

Secondary data can be obtained from many different websites. Some of the most popular repositories include:

GitHub
Kaggle
KDnuggets
UCI Machine Learning Repository
US Government’s Open Data
Five Thirty Eight
Amazon Web Services
BuzzFeed
Data is Plural
Harvard HCI

Each repository or individual dataset has its own terms of use and method for downloading. Be sure to read the description on the website you are using for how to access the data.

Secondary data can sometimes be obtained via an application programming interface (API). APIs are built around the HTTP request/response cycle.

A client (you) sends a request for data to a website’s server through an API call. Then, the server searches its database and responds either with the data, or an error stating that the request cannot be fulfilled.

Big Data

With advancements in digital technology, data can be collected and shared easier than ever. The result is the era of big data. The number of records and variables present in big data is often too large to manage locally, giving it its name.

The term ‘big data’ is used to describe large, complex data sets. Big data sets have been around since the 1960s; however, in the last 20 years there has been a considerable increase in the amount of data being driven, or made available, especially by large online services (YouTube, Netflix, Salesforce etc.).

On top of this, the IoT is a new source of big data, as connected devices capture and collate data on customer use and product performance.

Big data has three key properties: volume, variety, and velocity.

Each of these three presents unique challenges.

Volume: Data sets contain vast quantities of information that put high demands on systems used for storing, manipulating, and processing the information.

Variety: Until recently, spreadsheets, text files, and databases were the main sources of data for most applications. The increase in big data has brought about a diversity in the type and structure of data being analysed.

It’s common for systems to process data from many sources, including emails, images, video, audio, readings from IoT devices, and even scanned PDF documents. This variety can pose issues when storing data, extracting information (‘mining’), and for analysis.
Velocity: Vast quantities of data are being generated faster than ever, presenting challenges for analysts as more industries use this information. The ability to make instant decisions based on up-to-date information can make or break a business.

Conclusion

Data acquisition is so integral in the whole process of data analytics or data science,Building models and performing Exploratory Data Analysis might be fun and exciting but one must keep track of where the data was procured and what medium to ensure proper documentation and decision making.

Clinton’s blog

Clinton’s blog

Acquisition of Data In Analytics/Science