Data Mining: Practical Machine Learning Techniques for CRM
  • Home
  • Forums
  • Resources

Getting Started: ARFF Files

8/31/2014

0 Comments

 
Data mining platforms convert several data sources into a common data structure that allows an ecosystem of plug-in components to emerge and "speak a common language".

In Weka machine learning, this common file format is called Attribute-Relation File Format, or ARFF for short.
Picture
Converting Salesforce Objects to ARFF: SObj2ARFF

A data loader for converting Salesforce SObjects to ARFF is available at this Github repository.  Many of the articles in this blog assume use of ARFF files generated from a Salesforce.com CRM data source.

The tools and processes mentioned in this blog default to a "command line first" approach to enable automation long-term. All command line steps are depicted in block quotes.

Step 1
Download and Build SObj2ARFF
~/workspace/mkdir weka
~/workspace/git clone https://github.com/dataminingcrm/weka.git weka/
~/workspace/cd weka
~/workspace/weka  ./build.sh
This will clone the Salesforce converter project into a local directory named workspace/weka.
The build.sh script will build a single Java JAR with all dependencies at the location:
~/workspace/weka/bin/dataminingcrm.jar

Step 2
Configuration
Copy the provided configuration template to a file named config.properties and edit the file with Salesforce credentials and object source.
~/workspace/weka cp config.properties.template config.properties
~/workspace/weka vim config.properties

# Config file name/value pairs.
url=https://login.salesforce.com
username=username@domain.org
password=org_password
token=security_token
relation=Opportunity
query=SELECT * FROM Opportunity LIMIT 500
Step 3
Execution
Copy the config.properties to the bin directory and run the sobj2arff.sh script.
~/workspace/weka/cp config.properties bin/
~/workspace/weka/cd bin
~/workspace/weka/bin ./sobj2arff.sh
The output of step 3 will emit an ARFF file to the console (standard out). Alternatively, pipe this output to a *.arff file.
~/workspace/weka/bin ./sobj2arff > opportunities.arff
The ARFF file will contain the object relationship, attributes, and data necessary to proceed with analyzing the data in Weka Explorer.
Picture
Future articles will describe optimizing the Salesforce Object Query Language (SOQL) query for specific types of training or validation data sets.
0 Comments

Welcome to Data Mining CRM!

8/30/2014

1 Comment

 
Picture
"Begin With The End in Mind" - Stephen Covey 

Welcome to Data Mining CRM! This blog documents lessons learned applying various data science and machine learning techniques to Customer Relationship Management (CRM) data.

Salesforce.com CRM and Weka are my primary tools, both of which have free Developer tools available. Click on the "Resources" page for links to the tools discussed in this blog.

Audience
My interests span both business and technology, as such there are 3 audiences for this blog:

Financial Decision Makers
CFOs, Finance Executives, or Board Members who have a fiscal or fiduciary responsibility to an organization.  Blog posts categorized as "Financial" will explore the ROI of data mining and how to setup data mining initiatives for success. 

Business Decision Makers
Line of business leaders; VP of Sales, Analysts, and other business users. Blog entries tagged "Business" will "begin with the end in mind" to first identify business objectives to be achieved, then work backwards to apply data mining techniques. 

Technical Decision Makers
Developers, Analysts, Architects, Data Scientists, Statisticians; anyone who gets hands on with implementing data mining and machine learning technology. Blog entries tagged "Technical" will explore the full lifecycle of data mining; from building training data sets, classifying, making predictions, and operationally making data mining a repeatable process. 

Personal Journey to Data Mining

My personal journey to data mining began with attempts at applying Edward Tufte's information architecture print techniques to web dashboard designs and working backwards to understand how data must be structured to support rich analytics visualizations. This evolved into developing tools for analyzing site uptime logs, dabbling in predicting system behaviors, and developing a log analytics service (Logalytics.io). 

Several years were spent learning how to prepare and filter data so that it can be analyzed (NYTimes says about 50%-80% of data mining is "janitor work"... and yes, that's true).

Tufte's multi-variate visualizations help humans identify patterns and correlations that are not evident by looking at the raw data. Can computers be trained to identify these patterns? If so, what is the future impact on CRM dashboard design and information architecture?

Identifying potential customer opportunities involves creating reports and dashboards that apply some commonly understand correlations; "Show me all customers who have spent in excess of X dollars over the past Y months" or "Show me all customers who have opened a newsletter email or clicked on a particular link for a particular campaign". 

But what correlations are we missing? There's just too much data today for the classic analytics model to scale. Big data gets bigger everyday. Can we just dump all available customer data into a magic machine and have it reveal undiscovered correlations?

In my pursuit to answer these questions, I attended the Stanford online learning course for machine learning; which provides deep exposure to the statistical foundations of machine learning and artificial intelligence. However, my end goal of developing interactive, CRM-oriented, dashboards required a more practical approach to data mining, which I ultimately discovered through University of Waikato's online Weka courses. Weka's use of Java, coupled with some Marketing related learning recipes, provides a pragmatic approach to data mining CRM.

Next Steps
  • Amazon.com "People who bought X also bought Y"
  • Netflix.com "Recommended movies for you"
  • Google search results
  • YouTube recommended videos
  • Facebook activity feed and targeted ads

Machine learning (ML) recommendation engines were built into the foundation of the above mentioned brands, which gave them staggering competitive advantages. The travel and financial industries are experiencing churn as ML-focused services are making exceptionally relevant predictions on customer demands and disrupting previously established business models.

We live in an extremely dynamic society where a 360° view of the Customer involves data from CRM, ERP, social, mobile, Internet of Things sensor streams, and a variety of other systems of engagement. Data mining is our only hope to make sense of it all and evolve the craft of customer relationship management. I hope you'll actively comment on these blog entries and share in this journey.

(ps: Converting this blog into a book is an eventual goal. Therefore, I will be occasionally revisiting some posts and editing for brevity, or enhancing with diagrams. Apologies in advance if this iterative approach to blogging results in some comments or inbound references appearing slightly out of context. I'll do my best to mention article changes within the comments.)

1 Comment

    Author

    Michael Leach
    San Francisco, CA

    View Mike Leach's profile on LinkedIn
    Picture

    Archives

    September 2014
    August 2014

    Categories

    All
    Business
    Financial
    Technical
    Weka

    RSS Feed

Proudly powered by Weebly
Photo used under Creative Commons from Damian Gadal