Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

textplainer : Intuitive explanations for text based machine learning models

less than 1 minute read

Published:

Text data is a powerful a useful source of signal in machine learning systems. There are a set of very standard approaches to dealing with raw text to prepare it for machine learning and statistics. We routinely use: bag of words and n-gram models, TF-IDF, topic modelling or potentially word embeddings derived from one of the neural network language models like Word2Vec, GloVe or fastText. All of these approaches have proven themselves as effective text pre-processing techniques that require no domain knowledge and are widely applicable.

texturizer : Exploring diverse text derived features for machine learning

3 minute read

Published:

Text data is a fascinating source of information for data scientists. It can betray subtle clues as to the mood, motives and behaviours of people, in both conscious and unconscious expressions. We can extract text from a wide variety of sources: internal documents, email records, web forms, social media posts, and even the text descriptions from financial transactions.

dfsummarizer : A command line application for summarizing data frames

less than 1 minute read

Published:

Summarizing data is one of those small tasks that data scientists and analysts need to do routinely. However, we often need to write bespoke scripts to get exactly what we want, coping with missing values and assorted data types. We then need to go through a tedious process to format it for sharing or publication.

Why all scientists are not data scientists

less than 1 minute read

Published:

There is a meme you will see floating around the internet that comes in many forms, one version is shown in the header image above. It is part of the vague internet resistance to this new occupation. The response is somewhat justified, Data Scientist is a job title that requires no specific qualification, and garners differing opinions on what the core skill set is.

Mean Imputation in Apache Spark

less than 1 minute read

Published:

If you are interested in building predictive models on Big Data, then there is a good chance you are looking to use Apache Spark. Either with MLLib or one of the growing number of machine learning extensions built to work with Spark such as Elephas which lets you use Keras and Spark together.

books

Fury

Published:

A hungover young man in a youth hostel comes to terms with the grim reality of surviving the zombie apocalypse.

Download here

X-mas in Berlin

Published:

A young woman scours the streets of Berlin looking for signs that life is returning to the city. She clings to her memories of her missing family.

Download here

Googad Magee

Published:

Googad Magee is a children’s book about an old man struggling to find something good in his life.

A chance encounter with a happy go-lucky snail turns things around for Googad as he learns from her that it is very easy to appreciate what you already have. All proceeds from the sale of Googad Magee are donated to OzHarvest. An amazing organisation fighting food waste and feeding the needy. A childrens picture book about a sad old man who meets a happy snail. All proceeds donated to the Australian organisation OzHarvest.

Download here

GDSD - Getting Data Science Done

Published:

Getting Data Science Done outlines the essential stages in running successful data science projects. The book provides comprehensive guidelines to help you plan and manage data science projects, communicate with clients, identify and mitigate issues, and finally deploy your solutions into production systems.

publications

Published in , 2024

NFIA Controls Telencephalic Progenitor Cell Differentiation through Repression of the Notch Effector Hes1

Published in The Journal of Neuroscience, 2010

Recommended citation: Michael Piper, Guy Barry, John Hawkins, Sharon Mason, Charlotta Lindwall, Erica Little, Anindita Sarkar, Aaron Smith, Randal Moldrich, Glen Boyle, Shubha Tole, Richard Gronostajski, Timothy Bailey, and Linda Richards. (2010). "NFIA Controls Telencephalic Progenitor Cell Differentiation through Repression of the Notch Effector Hes1." The Journal of Neuroscience, July 7, 2010, 30(27):9127-9139.. https://www.ncbi.nlm.nih.gov/pubmed/20610746

Rational Structure-Based Rescaffolding Approach to De Novo Design of Interleukin 10 (IL-10) Receptor-1 Mimetics

Published in PLoS One, 2016

Recommended citation: Ruiz-Gómez, Gloria., Hawkins, John., Philipp, Jenny., Künze, Georg., Löser, Reik., Fahmy, Karim., and Pisabarro, M. Teresa. (2016) "Rational Structure-Based Rescaffolding Approach to De Novo Design of Interleukin 10 (IL-10) Receptor-1 Mimetics" PLoS One. Apr 28;11(4) http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0154046

Estimating Gaze Duration Error with Eye Tracking Data

Published in , 2023

Recommended citation: Hawkins, John. (2023) "Estimating Gaze Duration Error with Eye Tracking Data" Proceedings of the 2023 5th International Conference on Image, Video and Signal Processing Pages 70-75, Mar 25, 2023

Published in , 2024

talks

Predicting Nuclear Proteins

Published:

In this talk I presented initial work done with Mikael Boden on the task of building machine learning systems to classify proteins that are bound for the nucleus after transcription. It involves the creation of new datasets, and evaluating a range of existing techniques.

Evolving PTS2 Motifs

Published:

In this talk I presented the work done with Mikael Boden on the task of designing evolutioning algorithms to create regular expression like motifs to distinguish proteins that carry the PTS2 motif. This is a difficult classification task due to the absence of large data sets and highly variable sequences in the signalling section of the proteins.

The Statistical Power of Phylogenetic Motif Models

Published:

In this talk I presented the results of the research paper completed with Tim Bailey on the task of exploiting the phylogenetic information in comparative gene sequence alignments to try and improve the prediction of transcription factor binding site prediction.

Can Comparative Genomics Improve Transcription Factor Binding Site Prediction

Published:

In this presentation for the Institute for Molecular Bioscience at Queensland University I summarised some of the observations and conclusions that Tim Bailey and I had come to in working on the task of using information from gene sequence alignments to try and improve our ability to identify transcription factor binding sites.

Protein Structure Search Strategies

Published:

In this BIOTEC Post-Doc Seminar Series talk I gave an overview of the algorithms used to search protein databases to look for functional motifs and active sites that determine biological function and potetnial biomedical applications.

Being Bayesian

Published:

In this Sydney Data Science Meet-Up talk I gave an overview of the history and reasoning that lead to the distinction beetwen Frequentist and Bayesian Inference. I give several worked examples and show the results of simulations designed to answer the question under which circumstances should we prefer one over the other.

Full video of the talk here

Building Model Factories with the DataRobot API

Published:

In this Sydney Data Science Sponsored Meet-Up talk I gave an introduction of the idea of Model Factories, discussing the history of the idea and how it has lead to AutoML systems like DataRobot. Ultimately enabling us to build new forms of automated ML systems.

DataRobot Vs The Red Queen

Published:

In this talk I gave a brief overview of the Red Queen effect that has been used in evolutionary biology to describe co-evolution of competing species. I apply this idea to the competition betweenm organisations that are using data science and machine learning to differentiate against their competitors.

Introduction to Bayesian Machine Learning.

Published:

In this invited talk for the Machine Learning & Deep Learning Day I presented a ground-up introduction to understanding the fundamentals of Bayesian Machine Learning. I introduced the idea of Bayesian statistics and described the connections between maximum likelihood, maxium a-posteriori and finally the Bayesian goal of a complete estimate of the posterior distribution. I introduced Markov Chain Monte Carlo and the Metropolis Hastings Algorithm. Finally I share some brief cautions on how people from freqeuntist machine learning tend to go wrong either through their expectations or implementations.

Event Link

Modern Machine Learning Language Models

Published:

In this invited talk for the Selenium Day I presented an overview of the technical innovations that have led to modern machine learning successes with language processing. This involved discussing what is special about processing text, the fundamentals of recurrent processing, the development of attention and self-attention models, and finally how this led to the Transformer architecture.

Event Link

Minimum Viable Model Estimates for Machine Learning Projects

Published:

Prioritization of machine learning projects requires estimates of both the potential ROI of the business case and the technical difficulty of building a model with the required characteristics. In this work we present a technique for estimating the minimum required performance characteristics of a predictive model given a set of information about how it will be used. This technique will result in robust, objective comparisons between potential projects. The resulting estimates will allow data scientists and managers to evaluate whether a proposed machine learning project is likely to succeed before any modelling needs to be done. The technique has been implemented into the open source application MinViME (Minimum Viable Model Estimator) which can be installed via the PyPI python package management system, or downloaded directly from the GitHub repository.

Analytics Problem Framing

Published:

In this guest lecture at the Australian Graduate School of Management we discussed a range of fundamental ideas in analytics projects. All of these ideas relate to framing problems such that they have a greater chance of success.

Data Science in Industry

Published:

In this guest lecture for data science and analytics students at Imperial College London we discussed the emergence of data science as a career in industry. We covered both the historical conditions that created the field, and the onging changes and challenges that people face with being technical detail oriented people working with a wide variety of different business people.

Estimating Gaze Duration Error from Eye Tracking Data

Published:

Eye tracking applications produce a series of gaze fixation points that can be attributed to objects within a subject’s field of vision. Error is typically measured on the basis of individual gaze fixation point measurements. These applications are often used to infer a gaze duration metric from a series of fixation measurements. There is no direct method for infering the error in a gaze duration measurement from an error in fixation points.

Brands, Verticals & Contexts: Coherence Patterns in Consumer Attention

Published:

Consumers are expected to partially reveal their preferences and interests through the media they consume. The development of visual attention measurement with eye tracking technologies allows us to investigate the consistency of these preferences across the creative executions of a given brand and over all brands within a given vertical.

Evaluating Ad Creative and Web Context Alignment with Attention Measurement

Published:

Contextual targeting is a common strategy that places marketing messages in media locations that are aligned with a target audience. The challenge of contextual targeting is knowing the ideal schema and the set of categories that provide the right audience. Refinement of the contextual targeting process has been limited by the use of metrics that are either rapid but unreliable (click through rates), or reliable but slow, expensive and inaccessible in real-time (conversions or brand awareness).

teaching

, , 2024