In order to build successful machine learning solutions, there are certain fundamental ideas that everyone involved needs to understand. In this blog post, we look at three key early stages of the design process that managers can focus on to ensure that the project is headed toward a successful outcome.
]]>However, most of these approaches are relatively opaque which it comes to explaining what is driving the performance or outputs of the models. Text data is less amenable to SHAP values for local feature explanations, and there is no intuitive way to do permutation based feature importance correctly.
In the package we will explore methods of understanding the sub-structures of text that drive predictive performance.
]]>There are a set of very standard approaches to dealing with raw text to prepare it for machine learning and statistics. From bag-of-words and n-gram models, through TF-IDF, topic modelling and up to the various flavours of neural network derived word embeddings like Word2Vec, GloVe, fastText or BERT. This toolset provides us with a powerful set of options for feature engineering that require no domain knowledge and are widely applicable.
However, there are also a virtually unlimited number of alternative ways to transform text into features for machine learning. Many of these are only appropriate for particular kinds of problems because they involve domain specific feature engineering scripts, and potentially bespoke dictionaries (or embedding models). Discovering what will work can present a daunting task for a time limited project.
To accelerate my own experimentation I have put together a text feature engineering package that will take a tabular dataset and a list of text column names and generate new features. It will add the features as additional columns appended to the data. The user can control the types of features through the command line switches and thereby control the exact set of features they want to explore. Similarly, the user can switch off features that turn out to be ineffective.
The current set of options are:
-topics
Indicators for the presence of words from common topics (e.g. Politics or Family). Note that these indicator words are chosen as unambiguous indicators. In other words it is not a complete vocabulary, but a vocabulary that is specific to the given topic. This switch has an additional option -topics=count
which will counts all word matches from common topics.-pos
Part of Speech proportions in the text. Using a SpaCy language model to process and tag each word.-literacy
Checks for common literacy markers such as capitalisation problems or typos.-traits
Checks for common stylistic elements or traits that suggest personality type, such as pronoun usage.-rhetoric
Checks for rhetorical devices used for persuasion, for example an appeal to authority.-profanity
Profanity check flags, including masked profanities and racial slurs.-sentiment
Sentiment word counts and score plus a sentiment score from the TextBlob package.-emoticons
A wide variety of flags for different kinds of emoticons and emoticon sentiment.-comparison
Cross-column comparisons using a variety of edit distance metrics, note this is only applicable if you are running the application across multiple columns of text.Many of the features in the package are derived from custom word lists and regular expressions which users can edit and customize to their heart’s content. This means that the package should be easily customizable to make it work for your domain (or alternative language).
There is a bunch of work to do to make that customisation process easier. I also have a laundry list of additional features to add, but I feel like it is now ready for others to try. Please let me know if you get some value from it.
You can pip install it:
pip install texturizer
Many of the standard data summarizing routines in analytics packages leave a lot to be desired. The pandas dataframe summary function ignores missing values and only summarises numeric data. Most data scientists require more comprehensive summaries of a dataset to be sure that they are taking a reasonable modelling approach. There is no substitute for deeper dives into specific columns and features using plots and other visualisations, however, high level summarisation is a great guide to where to spend your efforts.
For all of these reasons I have built a small and focused command line application that will allow you to generate a summary table of your data and output it as either Latex or Markdown.
]]>You are not alone.
The growing concern is reflected in the numerous recently published studies from consulting firms like McKinsey or NGOs like the World Economic Forum and academic studies like this one from Oxford University. Unfortunately these ideas tend to make it to the public through many purveyors of click-bait and we end up with wide-spread distribution of info-graphics or cute new web applications with all of the subtlety in the original reports removed.
Each of the original studies makes many assumptions and draws attention to the complexity of forecasting the impact of technology on employment. The difficulty stems from the existence of many factors affecting the practicality of both building an AI system and then integrating it into an organisation.
]]>In spite of the fuzzy definition of the job there are good reasons that this new occupation exists. To understand those reasons we first need to clear up the misunderstanding that lies at the core this meme.
]]>