Bethan Blakeley gets back to basics when its comes to data analysis
Most of us, by now, will have heard of the infamous “data lake”. Another annoying piece of jargon being thrown about by data scientists, data analysts, business intelligence analysts, data magicians, or whatever other crazy job titles you might have come across.
It seems like we’re always talking in metaphors nowadays, and the data lake is no exception – if you’re dealing with anything vaguely resembling data, you’ll struggle to avoid it. Personally, I don’t like it.
Firstly, it’s completely unhelpful for those who don’t work with data or haven’t come across it before – if you’re one of those people, it essentially means “a heck of a lot of data” (enough to fill a lake I guess!). Secondly, I think it conveys a desperate image of a huge expanse of water, not being able to see any land, being stuck, lost, and forlorn in a never-ending sea of doubts. If you ask me, this isn’t the ideal picture we want to create of our industry, the people working in it, or the problems we’re solving and the solutions we’re creating.
I’ve always felt as though my personal opinions here have never quite aligned with the world’s in general (anybody who knows me will not be shocked at this in the slightest). I find the more time passes, the more complicated our industry becomes to the innocent bystander. Suddenly, what was just plain, old boring statistics becomes a new, shiny, “trendy” field of data science. We start throwing words around that essentially don’t mean a whole lot; machine learning and artificial intelligence being two of the biggest offenders.
Don’t get me wrong, I know there are a lot of people who are incredibly skilled in these areas. I’m not saying they don’t exist, or are pointless, or should be ignored. They shouldn’t. But they should be made more accessible, less intimidating, and easier to understand for Average Joe (or Josephine).
With that in mind, I have some tips that might help, no matter what your level of understanding. I may not be able to teach you how to “navigate the ever expanding and terrifying death trap of the data lake”, but everyone loves a bit of doggy paddle whilst they’re learning to swim right?
Paddle with a Purpose
Don’t set off on your analysis quest without an objective in mind. If you’re not looking for anything in particular, likelihood is you won’t find anything in particular. Make sure you have some questions at the forefront of what you’re doing, and keep bringing yourself back to these questions.
Clean up your act
No matter what sort of data you’re working with, how long it’s been around, or where you got it from – it could probably do with a good clean. The cleaning process is likely to change according to your different questions, too. Is your currency £s, $s, €s or something else? Do you need to convert these? How many NAs do you have? Is it a valid response? In some cases, you may want to replace an NA with a zero. In others, that would skew your results and change your data.
Bitesize Chunks
Yes, if you start by opening your dataset of however many million records, you might feel a bit overwhelmed. Start with something smaller and more manageable. There are several ways you can split up your data:
By sampling
Sometimes, it’s as simple as giving yourself a fraction of the original data to begin with. Make sure your data is randomly sorted, and take a smaller selection of it. Use this to test your hypotheses and answer your original questions, and then go back to the bigger dataset to ensure these answers are the same when looking at everything in one go.
By objective
I’m a big believer of only taking what you need (unless it’s chocolate, and then always take a bit extra). You may not need all your data points to answer some of your questions. If you decide that your customer’s brother’s dog’s favourite day of the week probably isn’t relevant, it probably isn’t. Get rid of it. Focus on what you need to begin with, and you can always add in more data at a later stage.
Eyes wide open
You may have heard of EDA (or exploratory data analysis). This is always one of the most useful parts of the process for me. Get to know what’s going on in your dataset. Explore it. Summarise it. Visualise it. Get to know the context of your variables. With your currency variables; is £10 normal, or is £100,000 closer to the average? What does this mean for you, for the data, for the questions you’re asking?
Mind the gap
You’ll often find yourself thinking “I wish I knew…” There is a huge amount of data out there, in the public domain, for anyone to use. Have a look and consider whether any of this additional data could add another dimension. Looking at clothes sales? How much does weather have an impact on this? Analysing customer engagement? Would it help to know the demographics of the customer’s local area?
Question time
They say common sense is not so common, and in the context of data analysis, they may be right. It’s important to question all your findings. All the time.
Does it make sense? Is it what I expected? Does it answer my original questions? Is it telling me something I can use? Is it useful for the business? Could I explain it to someone else? Would they believe me?
Sometimes, talking someone through the insight can be beneficial – if you’re met with blank stares or confused faces, chances are you may have missed something.
Beware the black box
Automation is everywhere now. But – that doesn’t mean we should use it everywhere now. It’s easy to find software and algorithms that can “analyse your data for you, at the click of a button” *insert cheesy sales pitch here*. Please approach these types of applications with caution. You know so much that they don’t. You know more about the data, the context, the analysis process so far. You know and understand your questions. You know what’s logical in the real world and what isn’t. By all means, use these types of solutions to aid your analysis – but don’t blindly rely on it to do everything, no matter what the cheesy sales pitch says.
Document everything
It may seem like a pain when you begin, but I can guarantee after you’ve accidentally misplaced your file, deleted your code, or killed off your laptop, you will be glad you did. Keep a note of all your steps, even if they seem small and insignificant. We’re not talking a publishable novel here; these notes are for your eyes only. Just jot everything down as you’re going; your variable transformations, how you decided to clean your currency, what you did with those NAs. It means if mistakes are made, it’s easy enough to go back to where you were.
And there you have it. My 8-step guide to paddling your way through the data lake, no need of an oar, or a snazzy automated-machine-learning-data-wizardry-speedboat-with-added-AI. Just you, your knowledge, your expertise, some confidence, and some curiosity. And that’s all you need
.