The data science toolkit
September 08, 2015
Give a data scientist a text editor and they will get stuff done. Budget for a data scientist’s toolkit and they can rock your world!
As a data scientist I tend to carry my toolkit around with me. My laptop is filled with software and utilities that make it easy to ingest, transform, analyse and visualise any dataset. First and foremost, is my text editor. My canvas. My gateway to code. I once uttered these immortal lines in a customer meeting “Just give me a text editor and I can do anything”. Strong statement, I’ll give you that. But regardless, whether its Vim, Notepad++, Sublime or something else, not a day passes when I don’t open up at least one text editor.
Then there is my core processing engine. I have Python and R installed but it’s the new kid, Spark, that’s taking all my attention right now – more on this in later posts. I’ve got Octave for hard maths (this gives away that I embraced the excellent machine learning course from Andrew Ng), Weka for machine learning, Tableau and Excel for data visualisation, and the rest of the Microsoft Office suite for communicating my results. Being in security there are additional tools for more specialised analysis: Wireshark for network traffic analysis and a hex editor. The list is endless and what’s more, it’s so easy to install them on my laptop and get going. And therein lies my problem.
Most security teams don’t have a data scientist. Which means that most security departments haven’t got a data scientist toolkit. When I step on site with a customer for the first time I know I am going to have Excel and Notepad (not ++ though!) but beyond that it’s very likely that the rest of my tool box is empty. How do I query the Qualys API to pull down vulnerability data for analysis? How do I handle badly formatted text fields that include file delimiters? How do I join these two datasets together? (Vlookup – really?!) How do I build my feature set? How do I analyse last month’s Proxy logs? (For reasons to do so see my last blog post). Where am I processing this data – locally? You get the picture. Security teams aren’t set up for data scientists and security policies are definitely not set up to allow a data science contractor to just install any old software.
So as that time of year comes round where those that hold the purse strings are starting to ask what money you need for next year, think about the poor data scientist who might join your team. Give them more than a fish, give them the the means to fish with, and they will be far more productive.
Think about the environment in which the data analysis will be conducted. Can you set up a data lab? Start small; a simple VM would probably do the trick. What’s your policy on cloud computing? Is AWS an option? Get comfortable with the processes and policies surrounding the analysis environment which will be hosting security data. Also, think about scalability – you don’t want to have to revisit every 3 months.
Then for the software. Most organisations use data analysis packages somewhere in the organisation, but are these right for security missions? (E.g. SAS). If using open source, you must ensure you understand the implications, and properly capture the cost to support the software. Consider the enterprise versions (Revolution Analytics for R, or Continuum for Python are some options) and what you get for your money. Think through the processes involved in a user wanting to install additional libraries and packages to unlock new insight.
Data science for security still needs to be budgeted for, supported and managed just like any other part of the business. Recognise this, and let your data scientists start delivering insight.
Good luck!