编程知识 cdmana.com

Elasticsearch preprocessing has no tricks, please use this move first!

1、 On the question

1.1 Online combat issues 1—— String segmentation

es According to _id String segmentation , Aggregate statistics again such as : data 1、_id=C12345 data 2、_id=C12456 data 3、_id=C31268

adopt es Aggregate statistics C1 The number at the beginning is 2 individual C3 The starting data is 1 individual

This API How to write , Under the guidance of a big man ?

1.2 Online combat issues 2——json turn object

When inserting , Can we transform the original data , Proceed again indexing

{
    "headers":{
        "userInfo":[
            "{  \"password\": \"test\",\n  \"username\": \"zy\"}"
        ]
    }
}

It's already a string , You can put this in the data insertion phase json Turn into object Well ?

1.3 Online combat issues 3—— Update array elements

I want to talk to a list Each value is followed by a character :

such as {"tag":["a","b","c"]} Such a document   I want to be  {"tag":["a2","b2","c2"]} In this way ,

Have you tried foreach and script Use a combination of ?

2、 Problem decomposition analysis

「 problem 1」: Analysis needs aggregate statistics , Of course with painless script Can also be realized , But there's a lot of data , There are bound to be performance problems .

You can put data processing in front of , The former _id Two characters are extracted , Treat as a field .

「 problem 2」: It is expected to do character type conversion when writing , Convert a complex string into a formatted one Object Object data .

「 problem 3」: All array type data are regularly updated , Of course painless script Scripts can also be implemented .

however , Processing in the write phase , Can greatly reduce the burden of later analysis link .

The three questions above , Use before writing java perhaps python Write a program to deal with , And then write Elasticsearch It's also a plan .

but , If you're going to knock it to death , Is there a better plan ? Can data be preprocessed before writing ?

3、 What is data preprocessing

In general , Our program writes data or from third-party data sources (Mysql、Oracle、HBase、Spark etc. ) Import data , It's all raw data. What does it look like , Direct batch synchronization ES, write in ES What is indexed data like . As shown in the figure below :

As mentioned above, there are three practical problems , The actual business data may not necessarily be what we need for our real analysis .

After reasonable preprocessing of these data , In order to facilitate the analysis and data mining of the following links .

The steps of data preprocessing are as follows :

  • Data cleaning .

Mainly to remove Duplicate data , Noise elimination ( That's jamming data ) And fill in the default values .

  • Data integration .

Put data from multiple data sources into a unified data store .

  • Data conversion .

Transforming data into a form suitable for data mining or analysis .

stay Elasticsearch in , Is there any implementation of preprocessing ?

4、Elasticsearch Data preprocessing

Elasticsearch Of ETL tool ——Ingest node , Node roles have been divided 、Ingest Node action ,Ingest practice 、Ingest and logstash The advantages and disadvantages of pretreatment are compared . Students with relevant blind spots , You can move In the past Let's go over the knowledge points .

Ingest Node The essence —— Before the actual document is indexed , Use Ingest The node preprocesses the document .Ingest Node intercepts batch index and single index request , Apply conversion , The document is then passed back to a single index or batch index API Write data .

Here's a picture , More vividly illustrated Elasticsearch Data preprocessing process .

In the actual business scenario , The preprocessing steps are as follows :

  • step 1: Definition Pipeline, adopt Pipeline Data preprocessing .

According to the characteristics of the complex data to be processed , Targeted settings 1 One or more pipeline ( The Conduit ), The pink and yellow parts above .

  • step 2: Write data association Pipeline.

Write data 、 Update the data or reindex Index link , Specifies the... To process the index pipeline , It's actually writing the index with the above pipeline0 and pipelineZ Connect .

  • step 3: Write data .

Focus on :Ingest Index the actual document ( Indexing ) Before preprocessing the document .

5、 Practice one

5.1 Online problem 1 Realization

PUT _ingest/pipeline/split_id
{
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": "ctx.myid_prefix = ctx.myid.substring(0,2)"
      }
    }
  ]
}

With the help of script In the processor substring Extract substring , Construct a new prefix string field , Aggregation operations for analysis links .

5.2 Online problem 2 Realization

PUT _ingest/pipeline/json_builder
{
  "processors": [
    {
      "json": {
        "field": "headers.userInfo",
        "target_field": "headers.userInfo.target"
      }
    }
  ]
}

With the help of json The processor does field type conversion , The string becomes json.

5.3 Online problem 3 Realization

PUT _ingest/pipeline/add_builder
{
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
for (int i=0; i < ctx.tag.length;i++) {
      ctx.tag[i]=ctx.tag[i]+"2";
    } 
"""
      }
    }
  ]
}

With the help of script processor , Loop through groups , The content of each array field is refilled .

The space for , For a more detailed interpretation, see :

https://github.com/mingyitianxia/deep_elasticsearch/blob/master/es_dsl_study/1.ingest_dsl.md

6、 No preprocessing VS Write after preprocessing scheme comparison

「 programme 1」: Import the data as is Elasticsearch, Do it again in the analysis phase painless Script processing . Simple and crude .

Import a little bit of fun , It's a lot of processing !

As mentioned earlier ,script Limited processing capacity , And because script Add to the trouble of performance problems .

It is not recommended to use .

「 programme 2」: With the help of Ingest Nodes realize data preprocessing , Do the necessary data cleaning (ETL) operation , Even if you increase space storage ( For example, add a new field ), Space for time, too , Clear the way for subsequent analysis .

It seems that writing becomes complicated , In fact, it has to be .「 Space for analysis wins time 」.

Recommended .

7、 common problem

7.1 Ingest Do nodes have to be set ?

By default , All nodes are enabled by default Ingest, Therefore, any node can complete the data preprocessing task .

however , When the cluster data size is large enough , When the cluster is big enough , It is recommended to split node roles , And independent master nodes 、 Independent coordination nodes are the same , Set up a separate dedicated Ingest node .

7.2 pipeline When to appoint ?

Create index 、 Create a template 、 Update index 、reindex as well as update_by_query link You can specify pipeline.

7.2.1 Create index link to specify pipeline

PUT ms-test
{
  "settings": {
    "index.default_pipeline": "init_pipeline"
  }
}

7.2.2 Create template link to specify pipeline

PUT _template/template_1
{
  "index_patterns": ["te*", "bar*"],
  "settings": {
    "number_of_shards": 1,
    "index.default_pipeline":"add_builder"
  }
}

7.2.3 Update index link specifies pipeline( The original index is not specified )

PUT /my_index/_settings
{
    "index" : {
        "default_pipeline" : "my_pipeline"
    }
}

7.2.4  reindex Link add pipeline

POST _reindex
{
  "source": {
    "index": "source"
  },
  "dest": {
    "index": "dest",
    "pipeline": "some_ingest_pipeline"
  }
}

7.2.5 update Link assignment pipeline

POST twitter/_update_by_query?pipeline=set-foo

8、 Summary

The first three questions are all in the dogged Elasticsearch QQ Group 、 Online business issues discussed in wechat group . With the help of Elasticsearch Ingest Node preprocessing , Can solve the problem very well .

Ingest Pipelines yes Elasticsearch The core function of data preprocessing , Once applied to a production environment , You'll find it very “「 sweet 」”, And you can't do without it .

Reference resources :

https://dev.classmethod.jp/server-side/elasticsearch/elasticsearch-ingest-node/

《 Data analysis practice 45 speak 》

Recommend more :

Elasticsearch Of ETL tool ——Ingest node

  more short time more Learn quickly more More dry !

China 1/5 Of Elastic Certified engineers come from !

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top