Data Design Principles

Scalability, Integrity, Consistency.

Scalability

scalability

Allow for scalability when building a dataset, this needs to be able to grow in both number of observations and in terms of features observed.

We should consider both our need and future needs (this means documenting the data!).

Be aware of conditioning data to meet a tools requirement, that is to say that if we engineer the data to suit Tool X later we may find that Tool Y is a better option but the data was engineered in such a way that Tool Y is not able to fully consume our data. Try to separate the processing from the data storage.

Integrity

Validate the data at point of entry, ensure that transformations are documented and are traceable.

We will likely do things like Normalisation and Standardisation, or Encoding, if we're going to do it, make sure we do it correctly!

integrity

It's good practice to be able to change parameters without impacting the data quality. We should always think not only about how I can use the data today, but how might someone else want to use the data tomorrow.

Consistency

Use standard patterns and consistent binary notation (if 1 is true for this part of the data, it should be the same throughout the rest of the data).

consistency

Use a common, sensible, vocabulary, this is important because the data needs to make sense to the next person that reviews it, it also needs to make sense to the future you.

Similar features should be stored using consistent data types.