Research Profile

Improving Machine Learning Outcomes

2021-07-15T00:00:00-07:00

Improving Machine Learning Outcomes Focusing on Framing, Timing, and Targets

In order to build successful machine learning solutions, there are certain fundamental ideas that everyone involved needs to understand. In this blog post, we look at three key early stages of the design process that managers can focus on to ensure that the project is headed toward a successful outcome.

Full Post Here

textplainer : Intuitive explanations for text based machine learning models

2020-10-10T00:00:00-07:00

Text data is a powerful a useful source of signal in machine learning systems. There are a set of very standard approaches to dealing with raw text to prepare it for machine learning and statistics. We routinely use: bag of words and n-gram models, TF-IDF, topic modelling or potentially word embeddings derived from one of the neural network language models like Word2Vec, GloVe or fastText. All of these approaches have proven themselves as effective text pre-processing techniques that require no domain knowledge and are widely applicable.

However, most of these approaches are relatively opaque which it comes to explaining what is driving the performance or outputs of the models. Text data is less amenable to SHAP values for local feature explanations, and there is no intuitive way to do permutation based feature importance correctly.

In the package we will explore methods of understanding the sub-structures of text that drive predictive performance.

In Development Here

texturizer : Exploring diverse text derived features for machine learning

2020-09-06T00:00:00-07:00

Text data is a fascinating source of information for data scientists. It can betray subtle clues as to the mood, motives and behaviours of people, in both conscious and unconscious expressions. We can extract text from a wide variety of sources: internal documents, email records, web forms, social media posts, and even the text descriptions from financial transactions.

There are a set of very standard approaches to dealing with raw text to prepare it for machine learning and statistics. From bag-of-words and n-gram models, through TF-IDF, topic modelling and up to the various flavours of neural network derived word embeddings like Word2Vec, GloVe, fastText or BERT. This toolset provides us with a powerful set of options for feature engineering that require no domain knowledge and are widely applicable.

However, there are also a virtually unlimited number of alternative ways to transform text into features for machine learning. Many of these are only appropriate for particular kinds of problems because they involve domain specific feature engineering scripts, and potentially bespoke dictionaries (or embedding models). Discovering what will work can present a daunting task for a time limited project.

To accelerate my own experimentation I have put together a text feature engineering package that will take a tabular dataset and a list of text column names and generate new features. It will add the features as additional columns appended to the data. The user can control the types of features through the command line switches and thereby control the exact set of features they want to explore. Similarly, the user can switch off features that turn out to be ineffective.

The current set of options are:

-topics Indicators for the presence of words from common topics (e.g. Politics or Family). Note that these indicator words are chosen as unambiguous indicators. In other words it is not a complete vocabulary, but a vocabulary that is specific to the given topic. This switch has an additional option -topics=count which will counts all word matches from common topics.
-pos Part of Speech proportions in the text. Using a SpaCy language model to process and tag each word.
-literacy Checks for common literacy markers such as capitalisation problems or typos.
-traits Checks for common stylistic elements or traits that suggest personality type, such as pronoun usage.
-rhetoric Checks for rhetorical devices used for persuasion, for example an appeal to authority.
-profanity Profanity check flags, including masked profanities and racial slurs.
-sentiment Sentiment word counts and score plus a sentiment score from the TextBlob package.
-emoticons A wide variety of flags for different kinds of emoticons and emoticon sentiment.
-comparison Cross-column comparisons using a variety of edit distance metrics, note this is only applicable if you are running the application across multiple columns of text.

Many of the features in the package are derived from custom word lists and regular expressions which users can edit and customize to their heart’s content. This means that the package should be easily customizable to make it work for your domain (or alternative language).

There is a bunch of work to do to make that customisation process easier. I also have a laundry list of additional features to add, but I feel like it is now ready for others to try. Please let me know if you get some value from it.

You can pip install it: pip install texturizer

Or grab the source code here

dfsummarizer : A command line application for summarizing data frames

2020-07-03T00:00:00-07:00

Summarizing data is one of those small tasks that data scientists and analysts need to do routinely. However, we often need to write bespoke scripts to get exactly what we want, coping with missing values and assorted data types. We then need to go through a tedious process to format it for sharing or publication.

Many of the standard data summarizing routines in analytics packages leave a lot to be desired. The pandas dataframe summary function ignores missing values and only summarises numeric data. Most data scientists require more comprehensive summaries of a dataset to be sure that they are taking a reasonable modelling approach. There is no substitute for deeper dives into specific columns and features using plots and other visualisations, however, high level summarisation is a great guide to where to spend your efforts.

For all of these reasons I have built a small and focused command line application that will allow you to generate a summary table of your data and output it as either Latex or Markdown.

Get it here

Will your job be automated out of existance by AI?

2018-12-08T00:00:00-08:00

Are you feeling anxious about whether your career is in danger of being automated out of existence?

You are not alone.

The growing concern is reflected in the numerous recently published studies from consulting firms like McKinsey or NGOs like the World Economic Forum and academic studies like this one from Oxford University. Unfortunately these ideas tend to make it to the public through many purveyors of click-bait and we end up with wide-spread distribution of info-graphics or cute new web applications with all of the subtlety in the original reports removed.

Each of the original studies makes many assumptions and draws attention to the complexity of forecasting the impact of technology on employment. The difficulty stems from the existence of many factors affecting the practicality of both building an AI system and then integrating it into an organisation.

Why all scientists are not data scientists

2017-11-03T00:00:00-07:00

There is a meme you will see floating around the internet that comes in many forms, one version is shown in the header image above. It is part of the vague internet resistance to this new occupation. The response is somewhat justified, Data Scientist is a job title that requires no specific qualification, and garners differing opinions on what the core skill set is.

In spite of the fuzzy definition of the job there are good reasons that this new occupation exists. To understand those reasons we first need to clear up the misunderstanding that lies at the core this meme.

Mean Imputation in Apache Spark

2017-09-26T00:00:00-07:00

If you are interested in building predictive models on Big Data, then there is a good chance you are looking to use Apache Spark. Either with MLLib or one of the growing number of machine learning extensions built to work with Spark such as Elephas which lets you use Keras and Spark together.