DSC Weekly Digest 12 October 2021

Facebook, Social Media, and Jumping Sharks

Announcements
  • Build statistical and analytical expertise as well as the management and leadership skills necessary to implement high-level, data-driven decisions in Northwestern’s Online MS in Data Science. Earn your degree entirely online in classes that are led by industry experts who are redefining how data is used to boost efficiency and effectiveness in a wide range of fields. Learn more

  • Get to know TIBCO’s enterprise analytics platform that allows data scientists and business users to collaborate on advanced analytics using massively scalable in-database and in-cluster processing. Click here for more info


Click to Become A Member of Data Science Central

Spooky Scary Data Science Skeletons

October is the spookiest time of the year, when the ghosts and witches are out in force, there’s a chill in the air as gray clouds gather, and pumpkin-flavored, well, just about anything anymore seems ubiquitous. I blame a particular Seattle coffee chain for the last one, but there’s something about moving into Fall that focuses one’s mind on the spooky scary skeletons lurking underneath the bed.

In the realm of the data scientist, there are more than a few skeletons hiding in the closets as well. These are the things that keep analysts up at night, and no matter how well prepared you may be, these jump scares are enough to send anyone screaming.

Data Quality Demons. The business manager assured you that their data’s great and has everything you could ever need. Yet when you pry the lid off the coffin and stare at the mouldering remains of software projects past, you get the creeping sensation that perhaps the manager was a bit … optimistic … in his estimates. Inconsistencies in spelling, the use of arbitrary placeholders, lists of items stored as single strings, differing date and currency conventions, data type errors, these can usually be dispelled with intelligent analysis software, but the bigger demons come about due to cardinality misunderstandings, a failure to account for change in data over time, duplications with subsequent edits creating phantom information, and similar errors that can be difficult to catch and even harder to fix.

Sparse Metadata Monsters. These are more sublime issues having to do with data that was collected primarily to facilitate fast transactions at the expense of containing minimal metadata about those transactions. This includes identifying dimensional units (length, currency, and count units, such as three books not being the same as three cars), identifying the time over which a certain entity exists within the system, metadata about the provenance of the data (who entered it, why did they enter it, how valid is it, where is the source of record for that data), and so on. This data often determines the reliability of the data.

Modeling Mayhem. A recent prepress article about COVID-19 vaccine efficacies in Wisconsin made a modeling assumption about the number of people who had been vaccinated in the state. It turned out that the number was off by a factor of 100, and what had seemed like a strong statistical case against the vaccine became instead a strong case for the vaccine. These kinds of modeling errors can break careers.

Bias Boggarts. Sampling by its very nature can be fraught with gotchas. Is the sample representative of the overall population? What hidden assumptions were made about the questions being asked or the means that the information is gathered? For a long time, surveys were conducted over LAN lines, until a statistician realized that a growing number of people were no longer using them in favor of mobile phones, and those that were left were older, more conservative, and likely wealthier, skewing everything from product marketing to politics. 

Interpretation Imps. Having created a model and run the data, ultimately the question is how to interpret the results, and it is here that the imps of the perverse delight in ruining a data scientist’s day. Are the conclusions supported by the analysis? Is it possible that those who have commissioned the analysis will ignore all of the caveats about probabilities and will treat the results as absolute statements? (Yes). Will people justify their own agendas based upon your conclusions, even when the conclusions do not support those results at all? Oh, definitely.

Data Science can be fun and exciting, but it can also be filled with deadly traps and snarling beasts. Sometimes the best that you can do is to be aware of all the goblins and ghoulies, and of course, read Data Science Central.

Goodnight, sleep tight … don’t let the bedbugs bite!

Kurt Cagle
Community Editor,
Data Science Central

To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free! 


Data Science Central Editorial Calendar

DSC is looking for editorial content specifically in these areas for October, with these topics having higher priority than other incoming articles.

  • AI-Enabled Hardware
  • Knowledge Graphs
  • Metaverse
  • Javascript and AI
  • GANs and Simulations
  • ML in Weather Forecasting
  • UI, UX and AI
  • GNNs and LNNs
  • Digital Twins

DSC Featured Articles


Picture of the Week

 


To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com to your browser’s address book.

This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.

275 Grove Street, Newton, Massachusetts, 02466 US


You are receiving this email because you are a member of TechTarget. When you access content from this email, your information may be shared with the sponsors or future sponsors of that content and with our Partners, see up-to-date  Partners List  below, as described in our  Privacy Policy . For additional assistance, please contact:  webmaster@techtarget.com


copyright 2021 TechTarget, Inc. all rights reserved. Designated trademarks, brands, logos and service marks are the property of their respective owners.

Privacy Policy  |  Partners List



Source: Data Science Central

Tags: