Tips and Tricks in a world of Mix

Posts tagged ‘elasticsearch’

Installing ElasticSearch for Windows

 

For  usage of commands in cmd with ElasticSearch you should install curl.

Better put it in c:\ for convenience .

You should download ES   from here and install it.

You can check it in chrome – http://localhost:9200/

or

run it in cmd  c:\curl -X GET http://localhost:9200/

Run few commands to automatize the process and make it easier to work with :

1) running as a service on windows

c:\elasticsearch-{version}bin>service install
 
 
 
2)install plugins 
You better get the basics 
bin/plugin --install mobz/elasticsearch-head
bin/plugin --install lukas-vlcek/bigdesk
 
image

 

installing marvel will give you the tools that ES itself promotes

bin/plugin -i elasticsearch/marvel/latest

 

3)In VS2013 quick launch – type nuget – choose Package Manager

paste-

Install-Package NEST –Version  1.2.1

it will get you the .net api for elasticsearh. the version is optional . without it gets the latest  version

ElasticSearch – adding manual settings & mappings– “ make me autocomplete “

So , after the last post about elasticsearch that explains a bit the terms of the technology , I’m getting to real life problems.

So after I’ve done entering the data into the elasticsearch at the last post I have now to delete it all! Oh my , how did that happened!  What shell I do now ?

 

Well if you are only starting it’s not that bad , just delete the index , which will remove the data and existing mapping as well.

curl –XDELETE “http://localhost:9200/test”

WHY?

One of the demands was to make the data searchable by only few chars , and not the whole word.

So .. ?

Well actually that means that the default indexing that occurred while creating the index is not good enough , we should have defined the index settings manually suggesting from the beginning what kind of analysis should be performed on the index .

 

The nGram allows as to break the data that we enter to small tokens witch we can search later. So if you have

“Jerusalem”

and define nGram min- 7 , max – 20  ==> you”ll get indexed “[Jerusal, Jerusale, Jerusalem]”

Of course more logical is to start with two chars nGram min and go on..

Tried to set the indexing and mapping into one file but it failed with mistake

Analyzer [your_analyzer_name] not found for field [_all]

When I split it ,it worked.

To the last post I added the manual index definition : CreateIndex.js:

{

“index” : {

               “name” : “test”,

               “number_of_shrads”: 1,

               “settings” : {

                                    “analysis”:{

                                                         “filter”: {

                                                                        “your_name_for_nGram_filter”:{

                                                                                             “type” : “nGram”,

                                                                                              “min_gram”:”2”,

                                                                                               “max_gram”:”20”,

                                                                                               “token_chars”: [ “letter”, “digit” , “punctuation”, “symbol”]

                                                                                            }

                                                        }

                                           },

                                        “analyzer”:{

                                                       “your_name_for_index_analyzer”: {

                                                                      “type”:”custom”,

                                                                      “tokenizer” : “whitespace” ,

                                                                      “filter”: [“lowercase”, “asciifolding” , “your_name_for_nGram_filter”]

                                                                          } ,

                                                       “your_name_for_search_analyzer”: {

                                                                      “type”:”custom”,

                                                                      “tokenizer” : “whitespace” ,

                                                                      “filter”: [“lowercase”, “asciifolding” ]

                                                                          }

      }}}}}                                                                               

 

Than you run the curl to enter it :

curl –XPUT “http://localhost:9200/test” –d @c:\pathto\CreateIndex.js

{“acknowledged”:true}

Now we have the index settings right with autocomplete suggestions starting with 2 letters.

 

Now we reenter the mapping and data from the last post , just add some features to the mapping ;

CreateMappings.js :

{

            “mappings”:{

                         “name_of_your_object”:{

                               “_all”: {

                                              “search_analyzer”:“your_name_for_search_analyzer” ,

                                              “index_analyzer”:“your_name_for_index_analyzer” ,

                                    },

                                 “properties”:{

                                                        “field_you_don’t_wont_to_break_to_small_tokens”:{

                                                                      “type”:”string”,

                                                                     “index”:”not_analyzed”

                                                           },

                                                          “always_in_query_field” :{

                                                                          “type”:”string”,

                                                                           “include_in_all”:true

                                                              }

}}}}

          Than you run the curl

curl –XPUT “http://localhost:9200/test/name_of_your_object/_mapping” –d @C:\pathto\createMappings.js

{“acknowledged”:true}

 

Now we’ll enter the actual data as at the last post 

curl –XPOST “http://localhost:9200/test/name_of_your_object/_bulk –data-binary @c:\pathto\formatizedToIndex.json

 

Now you have data with analyzers inside the elasticsearch with autocomplete .. Happy searching!

Elasticsearch

Important comments

Before indexing the data ,create and insert mapping .

DateFormat for elastic – “dd/MM/yyyy HH:mm:ss”    –HH means 24 hours presentation

 

So after creating “test” index repository:

$ curl -XPUT 'http://localhost:9200/test/'

Mapping

1) $curl –XPUT “http://localhost:9200/test/targets/_mapping” –d @path_to/mapping.js

 

*without putting it to file didn’t work

*without quarrels

*without @ upon the path gives

{"error":"NullPointerException[null]","status":500}

*includes :

{

“targets” : {  –name of your object

              “properties” : {   –elastic search saved term

                **fields and its mappings

                 }

          }

}

Bulk Create

2)$curl –XPOST “http://localhost:9200/test/targets/_bulk” –data-binary @path_to/data.json

*put this line before each object serialized to json

{ "create" : { "_index" : "test", "_type" : "type1", "_id" : @some_unique_param } }

 

 

the terminology

Index is a keyword summary of a large content . Index allows to search for a needed content much faster than without it.

Document Parsing  a.k.a.  text processing, text analysis, text mining and content analysis. – when data added it’s processed by the search engine to be made searchable. Scan and process of the data called document parsing. In this process we create a terms = data/words list that has a mapping = reference to the terms , save it all to the disk and keep parts in memory for faster performance.

Lucene, which Elasticsearch and Solr are working with is a full-text search engine because they go through all the text before the indexing process.

Computers has to be programmed to break up text into their distinct elements, such as words and sentences. This process is called tokenization and the different chunks, usually words, that constitute the text are called tokens.

There are many specialized tokenizers, for example CamelCase tokenizer, URL tokenizer, path tokenizer and N-gram tokenizer.

Stop words – sometimes we want to avoid certain words from being indexed. For instance, in many cases it would make no sense to store the words on, for, a, the, us, who etc. in the index.

Relevancy with this kind of handle there is  a fair amount of irrelevant results.There are ways to minimize and partially eliminate them.

 

A Token is the name of a unit that we derive from the tokenizer, and the token therefore depends on the tokenizer. A token is not necessarily a word, but a word is normally a token when dealing with text. When we store the token in the index, it is usually called a term.

Forward Index – store a list of all terms for each document that we are indexing.It’s a fast indexing but not really efficient for querying , because querying requires the search engine to look through all entries in the index for a specific term in order to return all documents containing the term

Document Terms
Grandma’s tomato soup peeled, tomatoes, carrot, basil, leaves, water, salt, stir, and, boil, …
African tomato soup 15, large, tomatoes, baobab, leaves, water, store, in, a, cool, place, …
Good ol’ tomato soup tomato, garlic, water, salt, 400, gram, chicken, fillet, cook, for, 15, minutes, …

 

 

Inverted Index – is an approach where you index by the terms to get list of the relevant documents.Conventional textbook indexing is based on inverted index.

Term Documents
baobaob African tomato soup
basil Grandma’s tomato soup
leaves African tomato soup, Grandma’s tomato soup
salt African tomato soup, Good ol’ tomato soup
tomato African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup
water African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup

Often both the forward and inverted index are used in search engines, where the inverted index is built by sorting the forward index by its terms.

In some search engines the index includes additional information such as frequency of the terms, e.g. how often a term occurs in each document, or the position of the term in each document. The frequency of a term is often used to calculate the relevance of a search result, whereas the position is often used to facilitate searching for phrases in a document.

 

Mapping

A schema is a description of one or more fields that describes the document type and how to handle the different fields of a document.

Indexes, types and documents

Tag Cloud

%d bloggers like this: