Correlation & Causality: Why You Should Make Them Known When Visualizing Data
by Glenn Stegall, Systems Engineer at IOP
In a world full of information, where the transfer of data from one business to another needs to be swift and fluid, responding rapidly to emerging problems or spikes in traffic is key. How do we view, and then display, data? More specifically, how does a client see the data that we give them?
Generally, Data Visualization may be boiled down to three questions:
1. Correlation/Causality. How will the data be used?
2. Interpretation. What is the expertise of the person observing the data?
3. Presentation. Which format most clearly displays the data and its meaning?
This blog will be discussing the first of these three important questions, which deals with the importance of the data itself in how it can be used practically. Simply possessing information serves no practical purpose. Data must “speak” in some way, and the mode in which it is visualized can make all the difference.
In statistics, it is a fundamental principle that correlation does not necessarily mean causation. This is true because graphs and statistical information may show that two or three points of observation are related to each other, or even influence each other. However, an astute observer learns to factor out his/her own bias and look beyond what may appear obvious to many in order to determine the actual cause for a given segment of data — a task that can take some digging combined with out-of-the-box thinking.
Why This Matters
Data visualization is fundamentally a statistical subject. No, it may not go to such depths as having to compute for α’s or σ’s, however, the way(s) in which data is presented to the client should be directly related to how it will be used. And when creating a chart or graph that can be updated in near real time, it is important for the user to understand quickly how A influences C, as well as why.
To illustrate this point, let us examine a hypothetical example of the sale of a newly released gaming console called PlayUrWay 5.
Assume you have a business that specializes in general electronics and entertainment, such as televisions, projectors, computers, and consoles. After a lot of hard work and messy contractual obligations, you finally manage to be one of the first businesses to sell this latest edition gaming rig! It is indeed a big deal and a major step toward remaining relevant in the modern era of electronics online sellers.
The day you have been anxiously awaiting comes — your online portal is open to the public. Much to your surprise, all fifty-thousand units of the console you have in inventory sell out in five minutes! Demand shot straight past your wildest expectations, and your hard work paid off! Later that day, you go about contacting PlayUrWay 5 to get more consoles on order, but much to your chagrin, your request is rejected. When you ask why, the company directs you to your own website. You hesitantly check the page where you sold out all fifty-thousand PlayUrWay 5’s and scroll down to the reviews where you find that the product has thousands of 1-star ratings from angry customers who placed orders and were subsequently rejected, even after having received confirmations for their orders. Most of these customers are so upset that they vow to never return to your store again. Sales were great on this first day, but in the future, your business might very well be ruined. What happened here?
This is a case of a practice known as “Scalping”. It is a problem that can be avoided by either preparation, or real-time observation of the website traffic.
Scalping is the process of resellers purchasing an item before a normal customer can, with the express intent of buying up all available quantities of that item, only to sell it again, publicly, at a significantly higher price. For popular items, such as the latest console, “bots” are deployed to purchase an item from a website the exactmoment it comes available. This means the average customer would not possibly be able to order the product before it is sold out, as they are competing with computers placing the same orders, almost instantly.
What does this have to do with data visualization, much less causation or correlation? Everything. A business owner might correlate rapid sales with high demand and good profit return. However, in this example, with good data visualization, a technician who is watching web traffic in real time could easily see a causal pattern emerging, as 98% of purchases for the popular PlayUrWay 5 console would be coming from a small number of IP addresses. This is, in fact, a data problem that can be solved with proper analytics and/or a proactive response.
Bots present a serious problem for online portals, not just because a select few people can use computers to purchase products en masse nigh instantaneously, but also because it clogs web traffic for that company and presents dangerous scenarios. Let us explore the sale of PlayUrWay 5 a little more in depth.
Customer Nicole tries to purchase the PlayUrWay 5 for her son. It is shown to be in stock. Nicole enters her credit card information. The information is validated, and as the purchase is being certified, and an email is generated to her attached account, notifying her of the successful purchase, suddenly, PlayUrWay 5 inventory registers as zero. In the time it took her to complete an order via the round trip involving the credit card company, the inventory warehouse, and the purchase portal, several computers ordered ten-thousand units of the PlayUrWay 5 console to hundreds of similar addresses and P.O. boxes. Now, Nicole has a confirmation order for a product she will never receive. Sometimes, systems can catch this in real time and notify the customer that the product is no longer available before the transaction has been completed. Other times, the customer may receive an email update after the fact, informing them that the product is unfortunately out of stock, and that their order cannot be fulfilled.
This company does not appear to have a system for automatically detecting and rejecting automated purchases from scalpers, but with proper visualization, a technician can easily figure out several things:
A. There are thousands of individual purchases coming from similar IP addresses in the same time frame.
B. These accounts are all using similar generically scrambled names that could not be real users.
C. Most web traffic successfully completing an order in less than ten seconds from start to finish is likely not human.
Knowing this information, a technician or IT department can then take steps to mitigate abuse of the company’s webservices by restricting automated activity with bot detectors, Captchas, or enforced security checks, such as requiring a valid email account to initiate the purchase. Without the ability to see the data in real time, none of these adaptive responses are possible.
Good data visualization allows this type of malicious activity to be seen and understood. In this case, it properly displays the amount of web traffic, the type of web traffic, and the details of the accounts associated with traffic and their subsequent purchases. Using data visualization techniques, a business can, in real time, mitigate an issue and perform damage control after the fact, saving their reputation from the potential of permanent damage.