The difficulties of collecting good data

Working with data is hard!

Data comes in all shapes and sizes, there is as much variation as you can possibly imagine, probably more so in fact. There is one constant; working with data is hard!

data-pipeline

Data Collection

Data collection is time consuming and is often resource intensive. We can buy data, but this is usually expensive and not always suitable for what we need, therefore even purchased data needs to be cleaned and processed.

Data Documentation

Data is often poorly documented. What was collected, where from, has there already been any processing, all of this is important information as it helps us understand how accurate the data is.

Data Quality

Data is often poor quality, we only really get to see just how poor when we start analysing the data.

Data Volume

We hear the term Big Data and things like there is so much data that companies don't know what to do with it. Yet when it comes down to data science, often there is simply not enough data.

We need to collect more data, this takes time.

Combining new data with old data can be complicated.

Data Ethics

Probably a post in itself, but the ethics around data is a complicated minefield. Where did this data come from, does the target know we collected it, are we processing it inline with what they agreed to at the point of collection, did they even read the terms and conditions (likely not) and so do they even know that we collected the data in the first place?