“Limitations” of Data

Hi everyone 🙂 Now that my CS midterm is finally over, can finally catch some breath and post about my thoughts regarding Meaningful Indicator for this week 😛

To start things off, I shall be introducing the most interesting data set of all time. Which is nonetheless about grades — Something we all can relate to.

When I was first introduced to this data set, although the color coding seems to be sending an apparent signal. The score listed under the “Total” row tends to exhibit an unresistable force which pulls my eye towards it.

To me, a data set is similar to a car with the metrics resembling each of the car parts. With that in mind, I analyzed this data set and it seems that things aren't as obvious as they seem.

Before we consider the “limitations” of the current data set, let's consider the utility aspect first. In term of utility, the color coding seems to depict a scene for both extreme end of the spectrum. As usual, the intermediate part of the spectrum is hard to differentiate. (From QR Experience)

As for the value of differentiating, I would say red holds more value than green as it seems to show a clear direction for improvement perhaps? But what I am more excited about (or maybe not) is cleaning this particular data set by introducing new variables / collapsing current existing ones and perhaps merging some of the variables that in my perspective may not be that meaningful.

It seems that in this inquiry module, it is leveraging on what we learned in our QR module. Initially, it hones our skill of analyzing data, asking critical questions and making data visualization out of it. After which it transit to collecting data for more “foreign” topics such as meritocracy. Now it seems to be presenting the opportunity for not only data cleaning but combine both data collecting and data cleaning together to make the data set more “meaningful”.

Before I set off to clean the data with my peers, perhaps I would like to reiterate one more key takeaway from today class. When in doubt, always define first. Just like how we use Reddit as a platform to gauge public acceptance and feedback in terms of upvote and comment. I feel that in order to classify this data set into the meaningful category, perhaps it is to be aired to the class and the meaningful gauge of the data set would be based on the class acceptance of it.

Furthermore, to me the “perfect” data set is one that not only can differentiate the extreme but allow the viewer to segregate the intermediate portion clearly such as who is “better” or score “higher” points than the other, but it may be a bit hard as a variety of matric have to be weigh and factor into this “black hole” equation before this can happen.

For my next medium post, I shall be doing an analysis of the meritocracy data set which my group collected over the span of 2 weeks. (Should be done soon hopefully, trying to increase the sample size). After which, I am going to dig into the cleaning of this midterm data set. Stay tuned and all the best to those who still have midterm next week 🙂 [Including myself, of course, last paper on next Friday 🙁 ]

Leave a Reply

Your email address will not be published. Required fields are marked *

[FIUrlhttp://mdm.miximages.com/Data Science/1vx-uhx1qLcHEHBbOWTkfJw.png]