Data Analysis sixth step: Are we (really) done?

Apr 7, 2020

Data analysis sixth step: data validation

The road so far

So, you started with a goal, you found some related datasets, preprocessed and processed them, and now you got results.

PREVIOUS BLOG POST

Are we (really) done? Not at all ?!

The road ahead

As said in the very first post, the Validation step goal is to understand if your work is reasonable and if the target performance is reached. The questions are: How do I validate the results? Should I compare them with similar internal/external activities? Do I have lesson learned from literature?

Thus, there are several questions to keep in mind: which are the models that can be applied successfully in this situation? Is an Artificial Intelligence approach feasible? Is a simpler mathematical method better suited?

Figure 1: Data Analysis main steps: focus on data processing

That is, before the processing step, even before thinking at the model to be used, you need to have a (preliminary) idea on how to validate it.

Schematically, I suggest three areas of validation.
The most intuitive one is a “statistical area” validation:

You certainly know evaluation metrics such as MAE, bias, relative errors, RMSE, MAPE, Kolmogorov-tests, Confusion Matrix specificity, sensitivity, precision, Likelihood Ratio, chi-squared tests… they give you a quantitative score of the goodness of your results. There are a lot of them, and you would probably want to use the ones that are typically understood and easily interpreted by the “audience” of your results.

The second is the “systematic area” validation:

Usually, you have a-priori knowledge from literature or from previous similar works: results, considerations, best practices. You can use this kind of benchmark to compare your results with reasonable expectations. However, especially if you are doing something innovative, maybe there will be no similar results. In this case, you’ll always have the final and most powerful validation final technique.

The final validation technique is the “cum grano salis” technique:

Look at you results and use your common sense, that is just the gut-feeling you have from the experience accumulated through your time-over-time hard work. If something seems fishy, even if you cannot explain why, an additional check is not a bad idea.

At the end of the validation, hopefully you will have your results and some kind of uncertainty associated. Now the next question is: are this performance good enough to answer my initial question? If not, go on with the fine-tuning step. If you are good with your performance, go on with final dissemination and deployment.

Please note: if you have no idea on how to validate the results of a particular model, you should change the model, because you are not able nor to interpret nor to defend it, nor to explain them, nor to re-use them: you just get results, but you really do not know how and why. So, next time, you have no idea if your old model is going to work, and with what kind of performance.

Lesson learned become tip and tricks!

Be communicative and precise: if you are using the NMAE without night-hours as a score function, say it. Otherwise, the people you are talking to (customers/colleagues…) will not be able to understand if your results are reasonable or not. Typically, everyone will think at his “preferred” score function and use it as comparison. You would be misunderstood, and you have to repeat everything again.
Consider who you are talking to: if they are accustomed to use a specific score function (let’s say, the relative error), they will have difficulties to compare it with a different one (lets’ say, the MAE) to understand if your results are good or not so good.
Remember the difference between errors and their meaning. For example, from a user’s point of view, mean absolute error (MAE) is appropriate for applications with linear cost functions- i.e., when the costs caused by a wrong forecast are proportional to the forecast error. RMSE is more sensitive to large forecast errors and hence suitable for applications when small errors are more tolerable and larger errors cause disproportionately high costs, which is the case for many applications in the energy market and for grid management issues. [1]
If comparing two approaches: The dataset used are exactly the same? If not, I DO understand the reason?
Remember: some misunderstandings can be very relevant, as in the figure below!

Real-world example

As an example, let us take the Performance Ratio calculation for your photovoltaic plants. The model is quite simple, it is just simple math, and everybody knows that the higher the PR, the better. Question is: what is the value you would expect from your power plant? For example, you could take the previous year PR, and look at the comparison, if the weather conditions were similar. If the difference is much greater than the 1% expected decrease, you could consider investigating deeper. Or you could compare you PR with some external reference; in this way, you would be able to understand if there is a very large problem in your internal data processing. For example, in Italy you could take PR values from the “Gestore Servizi Energetici” GSE, as in the following table, and compare it with yours (see interactive plot for an example of i-EM monitored PV plants). Again, if the discrepancy between your value and GSE values are too large, there could be a problem somewhere.

Figure 2: 2017 Performance ratio analysis of Italian PV power plants, source: GSE

Notes

[1] Sengupta et all, “Best Practices Handbook for the Collection and Use of Solar Resource Data for Solar Energy Applications: Second Edition”, NREL, Technical Report NREL/TP-5D00-68886 December 2017. (See reference)

For the curious costumer

At i-EM S.r.l., we think that a long journey starts from a single and smart step; also, we know that the devil is in the details, and a good procedure can help to take them into account.

Discover our solutions Contact us



For the keen reader

Some further readings I found interesting (not so much as my post, sorry for you…)

ANALYZE TEST DATA IN 6 STEPS: http://moore-english.com/analyze-test-data-in-6-steps/
7 TOP WAYS TO IMPROVE DATA ACCURACY: https://www.outsource2india.com/DataManagement/articles/ways-to-improve-data-accuracy.asp

Author

Fabrizio Ruffini, PhD

Senior Data Scientist at i-EM

Continue reading

Stay up-to-date with our latest news and events dedicated to renewable energy management.

i-EM’s Innovative Approach with Space Technology: the MOWGLI Project

Jul 15, 2024

i-EM is setting new standards with the use of space resources! Recently featured on the official ESA website and...

Digitising Data – The Way Forward for Dam Data Management

Apr 3, 2024

In an analogue world, data management is a real challenge. However, as we move towards a more digital age in the 21st...

The Energy Future: Human vs Machine Challenge in the AI Era

Jan 31, 2024

The last fifty years of history have been marked by significant technological developments, especially in the fields...

A Safe Journey With The Energy Trading Management Solution.

Nov 22, 2023

Issues related to energy, supply, and costs are becoming increasingly urgent and critical. In a context of global...

Get in Touch

If you would like to learn more about us, to ask for any information or get further details about our solutions write to us.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc	30 minutes	HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.

Performance

Analytics

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
__hstc	1 year 24 days	This is the main cookie set by Hubspot, for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_CBG1PQWR9W	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_175345779_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
hubspotutk	1 year 24 days	HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.

Others

Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.

Cookie	Duration	Description
_pk_id.27987.2878	1 year 27 days	No description
_pk_ses.27987.2878	30 minutes	No description