Data Analysis fourth step: It’s time to suite up!

Mar 3, 2020

Data analysis fourth step: data pre-processing

The road so far

In the previous posts, we saw how important is to define a goal and the question beneath, how to collect data to get insights, and how to explore them to understand better the situation.

PREVIOUS BLOG POST

Now, the next step is to make an order from the chaos of your raw datasets.

Figure 1: Data Analysis main steps: focus on data pre-processing

The road ahead

The main point of the pre-processing phase is to make sure your data can be used effectively from the model you chose to use. This phase is equivalent to prepare the ingredients of your recipe: you select, cut, modify, transform so that the data can be the inputs useful for that model specifics.

Thus, depending on the model, you will have to prepare your data in different ways. There are some typical actions to be done, and can be schematically divided into four areas:

A. Data cleaning 1. Remove not-useful variables 2. Remove non-useful subsample (ex: I decided that 2017-dataset is not needed for this analysis, so I drop it). 3. Remove or replace missing data (aka imputation) 4. …
B. Not-nominal data detection and tagging 1. Outliers 2. «Frozen» data 3. Out of reasonable boundaries 4. …
C. Data munging (wrangling) 1. Data type transformation (ex: from categorical to numeric) 2. Group by (ex: group by the hour of the day, make a running average…) 3. Create indices (some model can work faster with index-structures) 4. Create new variables as function of other variables (energy from power, kW from W…, a non-linear combination…) 5. …
D. Data transformation 1. Dimensionality reduction 2. Feature selection (filter, wrapper, embed, Random Forests…) 3. Feature extraction (PCA, LDA, ISOMAP…) 4. Normalize data (sometimes absolutely needed from the model, for example in many machine learning methods) 5. …

For the sake of the description, I schematically divided the activities into four areas, but they are usually very connected between themselves, and sometimes you do not really need to do all of them (also, not in that specific order). In this sense, it is like a continuously refining procedure, going back and forth depending on what you find in the previous step.

Lesson Learned becomes tip and tricks!

Very important: always report how the dataset change over each step (ex: after missing removal, what is the percentual of the dataset lost?) It will be an easy double-check to look if you made some unintended error (yes, everybody does mistakes, and no, there is no shame in double-checking…).

Real-world example:

We use the same data of the previous post to describe a real-world situation. Data from two irradiance sensors and corresponding period satellite data are visualized in the following figure as they were acquired (raw data).

Caption: raw data plots. On the top, from left to right: time vs sens_1, time vs sens_2, time vs sate. On the bottom, from left to right, sens_2 vs sens_1, sate vs sens_1, sens_1 vs sens_2, sate vs sens_2. Colored points are related to the same events on all plots.

As you can observe, we highlighted with three different colors (red, green and purple) three different situations where data are not reasonable and need preprocessing.

The red cluster corresponds to sensor_1 data all frozen = 0, and looking on the top time trend plot, these concentrate on the first part of the dataset. The second group of non-regular events is given by those highlighted in green. These are data where sensor_2 values are zero. Finally, the third interesting group is the one colored in purple, which are negative irradiance data recorded by sensor 1. Using the colored display, the dataset is better understood, and the comprehension improves. Then, we can tag non-regular data using simple mathematical algorithms (tagging) and then proceed to removal (cleaning) as in the following figure (much better-looking now, don’t’ you think ??).

Caption: preprocessed data plots. On the top, from left to right: time vs sens_1, time vs sens_2, time vs sate. On the bottom, from left to right, sens_2 vs sens_1, sate vs sens_1, sens_1 vs sens_2, sate vs sens_2.

NOTE: tagging those data as not reliable allows us to look for the corresponding datetime and try to find the reason: maybe, in those days there were in-site inspections which causes malfunctioning in the data acquisition, we could contact the responsible to avoid the same problem next time.

Not regular data are not just garbage to be removed, and crisis can also mean opportunities (危机, even if this translation is not very appropriate)!

For the curious costumer

At i-EM S.r.l., we think that a long journey starts from a single and smart step; also, we know that the devil is in the details, and a good procedure can help to take them into account.

Discover our solutions Contact us



For the keen reader

Some further readings I found interesting (not so much as my post, sorry for you…).

Data Cleaning in Python: the Ultimate Guide (2020) https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d
7 steps in mastering data preparation https://www.kdnuggets.com/2019/06/7-steps-mastering-data-preparation-python.html#
Handling missing data https://datascientistnotebook.com/2017/05/23/handling-missing-data/
Preprocessing (and more) with python https://chrisalbon.com/#Python
Detection of Outliers in Time Series Data http://epublications.marquette.edu/cgi/viewcontent.cgi?article=1047&context=theses_open
Robust statistical distances https://www.datadoghq.com/blog/engineering/robust-statistical-distances-for-machine-learning/#notes
Guide to Encoding Categorical Values in Python https://pbpython.com/categorical-encoding.html
Imputation, feature engineering, feature selection: python example https://nbviewer.jupyter.org/github/addfor/tutorials/blob/master/machine_learning/ml01v04_prepare_the_data.ipynb
Creating Python Functions for Exploratory Data Analysis and Data Cleaning: https://towardsdatascience.com/creating-python-functions-for-exploratory-data-analysis-and-data-cleaning-2c462961bd71

Author

Fabrizio Ruffini, PhD

Senior Data Scientist at i-EM

Continue reading

Stay up-to-date with our latest news and events dedicated to renewable energy management.

The Energy Management Challenge in 2024

Oct 1, 2024

How Smart Technologies Can Revolutionize Energy Grids In 2024, energy management represents one of the critical...

i-EM’s Innovative Approach with Space Technology: the MOWGLI Project

Jul 15, 2024

i-EM is setting new standards with the use of space resources! Recently featured on the official ESA website and...

Digitising Data – The Way Forward for Dam Data Management

Apr 3, 2024

In an analogue world, data management is a real challenge. However, as we move towards a more digital age in the 21st...

The Energy Future: Human vs Machine Challenge in the AI Era

Jan 31, 2024

The last fifty years of history have been marked by significant technological developments, especially in the fields...

Get in Touch

If you would like to learn more about us, to ask for any information or get further details about our solutions write to us.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.

Performance

Analytics

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
__hstc	1 year 24 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_CBG1PQWR9W	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_175345779_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
hubspotutk	1 year 24 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.

Others

Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.

Cookie	Duration	Description
_pk_id.27987.2878	1 year 27 days	No description
_pk_ses.27987.2878	30 minutes	No description