Text Summarization

Summarizing text is difficult. Some attempts have been made but they are often clunky. One that works ok is called Textteaser and can be found here: https://github.com/MojoJolo/textteaser It uses an algorithm called Density Based Selection to identify important sentences. It’s considered a selective text summarizer because it selects relevant sentences. Abstract summarizers attempt to summarize […]

Read more "Text Summarization"

Machine Learning with R

R is a popular language for Machine Learning.  Getting started is pretty easy.  First, install R on your local machine.  Then, try running the the script below.  You may need to install the two packages first.  The script uses the K Nearest Neighbor classification algorithm to learn what features or attributes may identify cancerous tumors. […]

Read more "Machine Learning with R"

Spark and Avro

To process Avro files with Spark you need to register with Kryo a serializer. Because Spark generally uses Kryo for serialization, you need to instruct Kryo to use Avro for serialization your Avro objects. Below is an example of a registered serializer using Groovy:

Read more "Spark and Avro"

Talking to HBase

Getting data out of HBase is a little more difficult than you’re average database. You have to first connect to Zookeeper and then you need to understand that each HBase table has column families and column qualifiers. It’s best to write some utility methods so that everyone on the team can quickly get up to […]

Read more "Talking to HBase"