Tips and Tricks in a world of Mix

Elasticsearch

Important comments

Before indexing the data ,create and insert mapping .

DateFormat for elastic – “dd/MM/yyyy HH:mm:ss”    –HH means 24 hours presentation

 

So after creating “test” index repository:

$ curl -XPUT 'http://localhost:9200/test/'

Mapping

1) $curl –XPUT “http://localhost:9200/test/targets/_mapping” –d @path_to/mapping.js

 

*without putting it to file didn’t work

*without quarrels

*without @ upon the path gives

{"error":"NullPointerException[null]","status":500}

*includes :

{

“targets” : {  –name of your object

              “properties” : {   –elastic search saved term

                **fields and its mappings

                 }

          }

}

Bulk Create

2)$curl –XPOST “http://localhost:9200/test/targets/_bulk” –data-binary @path_to/data.json

*put this line before each object serialized to json

{ "create" : { "_index" : "test", "_type" : "type1", "_id" : @some_unique_param } }

 

 

the terminology

Index is a keyword summary of a large content . Index allows to search for a needed content much faster than without it.

Document Parsing  a.k.a.  text processing, text analysis, text mining and content analysis. – when data added it’s processed by the search engine to be made searchable. Scan and process of the data called document parsing. In this process we create a terms = data/words list that has a mapping = reference to the terms , save it all to the disk and keep parts in memory for faster performance.

Lucene, which Elasticsearch and Solr are working with is a full-text search engine because they go through all the text before the indexing process.

Computers has to be programmed to break up text into their distinct elements, such as words and sentences. This process is called tokenization and the different chunks, usually words, that constitute the text are called tokens.

There are many specialized tokenizers, for example CamelCase tokenizer, URL tokenizer, path tokenizer and N-gram tokenizer.

Stop words – sometimes we want to avoid certain words from being indexed. For instance, in many cases it would make no sense to store the words on, for, a, the, us, who etc. in the index.

Relevancy with this kind of handle there is  a fair amount of irrelevant results.There are ways to minimize and partially eliminate them.

 

A Token is the name of a unit that we derive from the tokenizer, and the token therefore depends on the tokenizer. A token is not necessarily a word, but a word is normally a token when dealing with text. When we store the token in the index, it is usually called a term.

Forward Index – store a list of all terms for each document that we are indexing.It’s a fast indexing but not really efficient for querying , because querying requires the search engine to look through all entries in the index for a specific term in order to return all documents containing the term

Document Terms
Grandma’s tomato soup peeled, tomatoes, carrot, basil, leaves, water, salt, stir, and, boil, …
African tomato soup 15, large, tomatoes, baobab, leaves, water, store, in, a, cool, place, …
Good ol’ tomato soup tomato, garlic, water, salt, 400, gram, chicken, fillet, cook, for, 15, minutes, …

 

 

Inverted Index – is an approach where you index by the terms to get list of the relevant documents.Conventional textbook indexing is based on inverted index.

Term Documents
baobaob African tomato soup
basil Grandma’s tomato soup
leaves African tomato soup, Grandma’s tomato soup
salt African tomato soup, Good ol’ tomato soup
tomato African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup
water African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup

Often both the forward and inverted index are used in search engines, where the inverted index is built by sorting the forward index by its terms.

In some search engines the index includes additional information such as frequency of the terms, e.g. how often a term occurs in each document, or the position of the term in each document. The frequency of a term is often used to calculate the relevance of a search result, whereas the position is often used to facilitate searching for phrases in a document.

 

Mapping

A schema is a description of one or more fields that describes the document type and how to handle the different fields of a document.

Indexes, types and documents

Comments on: "Elasticsearch" (1)

  1. […] , after the last post about elasticsearch that explains a bit the terms of the technology , I’m getting to real life […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Tag Cloud

%d bloggers like this: