Summarizing text is difficult. Some attempts have been made but they are often clunky. One that works ok is called Textteaser and can be found here: https://github.com/MojoJolo/textteaser It uses an algorithm called Density Based Selection to identify important sentences. It’s considered a selective text summarizer because it selects relevant sentences. Abstract summarizers attempt to summarize […]Read more "Text Summarization"
Getting started with entity extraction and Stanford CoreNLP just takes a few steps. Grab the property file and NLP models from the Stanford CoreNLP github repo: all.3class.distsim.prop and all.3class.distsim.crf.ser.gz and then run the below Groovy code:Read more "Entity Extraction with Stanford CoreNLP"
R is a popular language for Machine Learning. Getting started is pretty easy. First, install R on your local machine. Then, try running the the script below. You may need to install the two packages first. The script uses the K Nearest Neighbor classification algorithm to learn what features or attributes may identify cancerous tumors. […]Read more "Machine Learning with R"
To process Avro files with Spark you need to register with Kryo a serializer. Because Spark generally uses Kryo for serialization, you need to instruct Kryo to use Avro for serialization your Avro objects. Below is an example of a registered serializer using Groovy:Read more "Spark and Avro"
Using OpenNLP to extract proper nouns is pretty easy. Here’s some code showing how to do it:Read more "Using OpenNLP to Find People, Places and Organizations"
Querying SolrCloud is pretty easy. Here’s a simple script:Read more "Querying SolrCloud"
Getting data out of HBase is a little more difficult than you’re average database. You have to first connect to Zookeeper and then you need to understand that each HBase table has column families and column qualifiers. It’s best to write some utility methods so that everyone on the team can quickly get up to […]Read more "Talking to HBase"