Blog posts

2021

Improving Machine Learning Outcomes

less than 1 minute read

Published: July 15, 2021

Improving Machine Learning Outcomes Focusing on Framing, Timing, and Targets

2020

textplainer : Intuitive explanations for text based machine learning models

less than 1 minute read

Published: October 10, 2020

Text data is a powerful a useful source of signal in machine learning systems. There are a set of very standard approaches to dealing with raw text to prepare it for machine learning and statistics. We routinely use: bag of words and n-gram models, TF-IDF, topic modelling or potentially word embeddings derived from one of the neural network language models like Word2Vec, GloVe or fastText. All of these approaches have proven themselves as effective text pre-processing techniques that require no domain knowledge and are widely applicable.

texturizer : Exploring diverse text derived features for machine learning

3 minute read

Published: September 06, 2020

Text data is a fascinating source of information for data scientists. It can betray subtle clues as to the mood, motives and behaviours of people, in both conscious and unconscious expressions. We can extract text from a wide variety of sources: internal documents, email records, web forms, social media posts, and even the text descriptions from financial transactions.

dfsummarizer : A command line application for summarizing data frames

less than 1 minute read

Published: July 03, 2020

Summarizing data is one of those small tasks that data scientists and analysts need to do routinely. However, we often need to write bespoke scripts to get exactly what we want, coping with missing values and assorted data types. We then need to go through a tedious process to format it for sharing or publication.

2018

Will your job be automated out of existance by AI?

less than 1 minute read

Published: December 08, 2018

Are you feeling anxious about whether your career is in danger of being automated out of existence?

2017

Why all scientists are not data scientists

less than 1 minute read

Published: November 03, 2017

There is a meme you will see floating around the internet that comes in many forms, one version is shown in the header image above. It is part of the vague internet resistance to this new occupation. The response is somewhat justified, Data Scientist is a job title that requires no specific qualification, and garners differing opinions on what the core skill set is.

Mean Imputation in Apache Spark

less than 1 minute read

Published: September 26, 2017

If you are interested in building predictive models on Big Data, then there is a good chance you are looking to use Apache Spark. Either with MLLib or one of the growing number of machine learning extensions built to work with Spark such as Elephas which lets you use Keras and Spark together.

John Hawkins

Blog posts

2021

Improving Machine Learning Outcomes

2020

textplainer : Intuitive explanations for text based machine learning models

texturizer : Exploring diverse text derived features for machine learning

dfsummarizer : A command line application for summarizing data frames

2018

Will your job be automated out of existance by AI?

2017

Why all scientists are not data scientists

Mean Imputation in Apache Spark