Growth Hacking with Data Science — 600% Increase in Qualified Leads with Zero Ad Budget
About 2 years ago, I worked as a data engineer alongside a team of extra-ordinary gentlemen where our daily activities involved keeping the data lake (3 node Hadoop Cluster) running. We had standard Hadoop ETL jobs starting with Sqoop jobs extracting data from our OLTP system (MSSQL Server) and loading into HDFS. Transformations with data flow language, Pig Latin,
and loading into Elasticsearch for search analytics and Kibana dashboards for visualizations.
As with every other data engineer in a space where business expectations assumed that your role was more than maintaining data lake infrastructure and ETL pipelines, we expected to generate insights from the data as well as create predictive algorithms.
In the rest of this blogpost, I will be diving into one of the use cases where we built an end-to-end ML solution, Lead Scoring Engine. The solution started as a decision-support for Customer Care Representatives in qualifying leads into an application that created a viral effect yielding about 600% increase in qualified leads with zero ad budget.
Business Expectations of Data Science Projects
After a series of stakeholder consultations in formulating a business problem to solve using data science, we committed to helping the business unit achieve one of its key objectives — Increase the number of leads by X% through a month-on-month growth. With this we were able to define together
with stakeholders measures of success (Metrics) and KPIs that informed progress in our commitment.
To be fair, our alignment with the business stakeholders was not a smooth sail at the outset. This is not to say we didn’t eventually get their buy-in. This we got after demonstrating that data being collected, if properly massaged into a consumable state, is a key tool in making informed business decisions.
With metrics and KPIs defined, we had to identify all potential data sources that would be useful in the business goal we committed to (internal and external). For the purpose of this post, I would only focus on the internal sources. They were:
• MSSQL Server (OLTP) database (structured)
• IIS Web Server Logs (semi/un-structured)
Our data sources, unstructured and structured, required different processes to have them in our HDFS data lake, eventually in our search analytics tool.
What the both ETL workloads have in common nightly batch processing with Sqoop jobs exporting from MSSQL Server into HDFS and a Python script extracting access log files from an FTP server.
The insights we found from gleaning the data were astounding. Since we didn’t have enough domain knowledge, we started from more technical insights that we were comfortable with as software engineers — latency, server response time, traffic peaks, etc. We wore our analysts’ hats and made financial implications that latency can causes.
Next, consulting with the business analysts we were able to leverage their domain knowledge to glean actionable insights. We started with the usual demographics, to time of the day when signups happened. We delved deeper to understanding traffic sources as stated in the signup forms and matched with traffic sources from the access logs based on ‘utm’ tags.
Lead Scoring Engine
We took the idea from a chapter in the book Advanced Analytics with Spark, which was a churn prediction example. (In hindsight, we should probably have made use of SciKit-Learn). Building on this example, we put a Random Forest model based on demographics data, time of the day of signup, traffic source, last viewed page before signup, etc. With these features, our trained model achieved 92% accuracy on test set.
By interpreting the model to business stakeholders, we were able to get their buy-in and provide decision support to Customer Care reps using ML.
0 to 600% Qualified Leads with Zero Ad Budget
Having looked at the demographics data and traffic sources, we formulated varying assumptions based on the insights from the data. They were:
• Assumption 1: If about 68% of signups are between ages 25 and 45, then they must be young professionals with have strong online social presence.
• Assumption 2: If these young professionals have online social presence, then they must be active on LinkedIn.
• Assumption 3: If they are active on LinkedIn, then they would be willing to let their connections know about their career/development progress.
• Assumption 4: Every signed user has at least 150 LinkedIn first circle connections.
Hypothesis: If the preceding assumptions hold true, then we can have every signed up user spread-the-word to professional circle through LinkedIn to pull more qualified leads like them.
With this, we built a data product in NodeJS that sent out emails every night to daily new signups to let their professional connections know how they are advancing their career development by signing up for the program.
The email reads:
Hi <First Name>,
Let your connections know how your are advancing your career. Tell them about the <Name of the Educational Program> you just applied for.
[Button with LinkedIn Logo Here] [Button with Google Logo Here]
A click on the LinkedIn button takes a user directly to a simple HTML page requesting the user to automatically sign in with his/her LinkedIn account. Once the user completes this phase, the application redirects user to LinkedIn account. Using LinkedIn’s graph API, we pre-populated a post for the user to share:
I just applied for <Name of the Educational Program>.
I think this is a great opportunity for you to also advance your career.
Sign up now!!! <URL to Sign Up>
The results we had were massive, with 14 users responding to the email in first 10 days and average number of LinkedIn connections per user at around 120, we grew the qualified leads by 600%. We got recognition and more support from the stakeholders; and every other business unit wanted engage the data team in achieving their business objectives.
Data Science, ML, AI, Deep Learning whatever you want to call it (we could have a whole day of discussion on the right term) is not valuable until it addresses and solves a business objective.
Many a time, I hear “I am learning SVM, LSTMs and, Deep Reinforcement Learning, etc.” — you know the rest of the story. And from the business stakeholders, “We don’t have any data. In fact, we need big data before we data science… Wait a minute, we need a data scientist — who must be able to create backend applications, very sound in statistics and machine learning ….”.
In my little experience, start with the “Why?” and don’t get too fancy with the shiny buzzwords. Define problem space- if a simple linear/logistic regression would suffice, start with that and measure impact.
I hope you found this resourceful. Kindly share your thoughts and comments — looking forward to your feedback.