编程知识 cdmana.com

Project actual combat 01: what will happen if 300 Tang poems are written into elasticsearch?

1、 Actual project

Write three hundred Tang Poems into Elasticsearch What's going to happen ?

2、 Project description

This project is a small project condensed according to the actual combat project , It covers almost all the knowledge points explained before .

Through the actual combat of this project , It can let you connect the previous knowledge points and apply them to actual combat , And set up requirements analysis 、 Integral design 、 Data modeling 、ingest Pipeline use 、 retrieval / Aggregation selection 、kibana Visual analysis, etc .

3、 demand

Data sources :https://github.com/xuchunyang/300

Pay attention to the data source bug:  The first 1753 Line planting "id":178 It needs to be changed manually to  "id": 252.

3.1 Data requirement

Be careful :

  • 1) Dictionary Selection

  • 2) The selection of word participator

  • 3)mapping Set up

  • 4) The dimensions of consideration support

  • 5) Set the insertion time ( Custom dynamic add , Not artificial )

3.2 Write requirements

Be careful :

  • 1) Special character cleaning

  • 2) Add insert time

3.3 Analyze requirements

Search analysis DSL actual combat

  • 1) Flying flowers link : Contains the inscription Yi All over the world ( Each contains ) What are the verses ? How many songs are there ?

  • 2) There are several poems by Li Bai ? According to the length of the poem , From short to long

  • 3) take TOP10 The longest 、 List of authors of the shortest poem

Aggregate analysis of actual combat and visual combat

  • 1) Who has the most works ? take TOP10 Ranking

  • 2) The proportion of five character quatrains and seven character metrical poems , And the corresponding author percentage statistics

  • 3) Ranking poems of the same name

  • 4) What kind of word cloud is formed by the word segmentation of 300 poems

4、 Demand interpretation and Design

4.1 Demand interpretation

In line with : Before coding , The principle of design first .

Common faults of developers —— After the new project gets the demand , Whether it's simple or complex , We have to sort out the needs first , Sort out its logical structure , Design first , In order to build global awareness , Instead of just coming up and Typing Code .

The core knowledge of this project covers the following parts

  • Elasticsearch Data modeling

  • Elasticsearch bulk Batch write

  • Elasticsearch Preprocessing

  • Elasticsearch retrieval

  • Elasticsearch polymerization

  • kibana Visualize Use

  • kibana Dashboard Use

4.2 Sorting out the logical structure

There's a picture, there's a truth .

According to the requirements, sort out the following logical structure , In actual development, the following data flow should be kept in mind .

4.3 Modeling carding

There was a story about , Here again, the importance of data modeling .

Data models support systems and data , Systems and data support business systems .

A good data model :

  • Can make the system better integrated 、 Can simplify the interface .

  • Can simplify data redundancy 、 Reduce disk space 、 Improve transmission efficiency .

  • Compatible with more data , The implementation logic will not be changed due to the addition of data types .

  • Can help more business opportunities , Improve business efficiency .

  • Can reduce business risk 、 Reduce business costs .

about Elasticsearch The core of data modeling is Mapping The construction of .

For primitive json data :

    "id": 251,
    "contents": " Fight the Orioles , Don't teach the branches to sing . When I cry, I dream , Don't get Liaoxi .",
    "type": " Wu rhyme ",
    "author": " jin changzu ",
    "title": " Spring resentment "

Our modeling logic is as follows :

Field name Field type Remarks
_id
Corresponding to self increasing id
contents text & keyword Involving participle , Attention opening :fielddata:true
type text & keyword
author text & keyword
title text & keyword
timestamp date Stands for insertion time
cont_length long contents length , For sorting

Because it involves Chinese word segmentation , It's important to choose a word breaker .

It's still recommended here : choice ik participle .

ik A suggested choice of dictionaries : The dictionary is not complete , Search the Internet for some common language dictionaries on the Internet 、 Industry dictionaries such as ( A dictionary of poetry ) As a complement to .

4.4 Outline design

  • Original document json Batch read and write through elasticsearch python Low version api and Higher version api elasticsearch-dsl In combination with implementation .

  • Data preprocessing is done by ingest pipeline Realization . Design data preprocessing place : Of every poem json When writing , Insert timestamp Timestamp field .

  • template and mapping Is built by kibana Realization .

  • Participle selection :ik_max_word Fine-grained participle , To see a more granular word cloud .

5、 Project practice

5.1 Data preprocessing ingest

establish :indexed_at The pipe , Purpose :

  • newly added document When you specify the insert timestamp field .

  • New length field , To facilitate subsequent sorting .

PUT _ingest/pipeline/indexed_at
{
  "description": "Adds timestamp  to documents",
  "processors": [
    {
      "set": {
        "field": "_source.timestamp",
        "value": "{
    {_ingest.timestamp}}"
      }
    },
    {
      "script": {
        "source": "ctx.cont_length = ctx.contents.length();"
      }
    }
  ]
}

5.2 Mapping and template structure

as follows DSL, The templates are built separately :my_template.

It specifies settings、 Alias 、mapping Basic settings of .

The benefits and convenience of templates , It was explained in detail in the previous chapter .

PUT _template/my_template
{
  "index_patterns": [
    "some_index*"
  ],
  "aliases": {
    "some_index": {}
  },
  "settings": {
    "index.default_pipeline": "indexed_at",
    "number_of_replicas": 1,
    "refresh_interval": "30s"
  },
  "mappings": {
    "properties": {
      "cont_length":{
        "type":"long"
      },
      "author": {
        "type": "text",
        "fields": {
          "field": {
            "type": "keyword"
          }
        },
        "analyzer": "ik_max_word"
      },
      "contents": {
        "type": "text",
        "fields": {
          "field": {
            "type": "keyword"
          }
        },
        "analyzer": "ik_max_word",
        "fielddata": true
      },
      "timestamp": {
        "type": "date"
      },
      "title": {
        "type": "text",
        "fields": {
          "field": {
            "type": "keyword"
          }
        },
        "analyzer": "ik_max_word"
      },
      "type": {
        "type": "text",
        "fields": {
          "field": {
            "type": "keyword"
          }
        },
        "analyzer": "ik_max_word"
      }
    }
  }
}


PUT some_index_01

5.3 Data reading and writing

Through the following python Code implementation . Be careful :

  • bulk Bulk writes perform much better than single writes .

  • Especially for large file writing priority bulk Batch processing .

def read_and_write_index():
    # define an empty list for the Elasticsearch docs
    doc_list = []


    # use Python's enumerate() function to iterate over list of doc strings
    input_file = open('300.json',  encoding="utf8", errors='ignore')
    json_array = json.load(input_file)


    for item in json_array:
        try:
            # convert the string to a dict object
            # add a new field to the Elasticsearch doc
            dict_doc = {}
            # add a dict key called "_id" if you'd like to specify an ID for the doc
            dict_doc["_id"] = item['id']
            dict_doc["contents"] = item['contents']
            dict_doc["type"] = item['type']
            dict_doc["author"] = item['author']
            dict_doc["title"] = item['title']


            # append the dict object to the list []
            doc_list += [dict_doc]


        except json.decoder.JSONDecodeError as err:
            # print the errors
            print("ERROR for num:", item['id'], "-- JSONDecodeError:", err, "for doc:", dict_doc)
            print("Dict docs length:", len(doc_list))






    try:
        print ("\nAttempting to index the list of docs using helpers.bulk()")


        # use the helpers library's Bulk API to index list of Elasticsearch docs
        resp = helpers.bulk(
            client,
            doc_list,
            index = "some_index",
            doc_type = "_doc"
            )


        # print the response returned by Elasticsearch
        print ("helpers.bulk() RESPONSE:", resp)
        print ("helpers.bulk() RESPONSE:", json.dumps(resp, indent=4))
    except Exception as err:
        # print any errors returned w
        ## Prerequisiteshile making the helpers.bulk() API call
        print("Elasticsearch helpers.bulk() ERROR:", err)
        quit()

5.4 Data analysis

5.5 Search analysis

5.5.1 Flying flowers link : Contains the inscription Yi All over the world ( Each contains ) What are the verses ? How many songs are there ?

GET some_index/_search
{
  "query": {
    "match": {
      "contents": " Inscription "
    }
  }
}


GET some_index/_search
{
  "query": {
    "match": {
      "contents": " Yi "
    }
  }
}


GET some_index/_search
{
  "query": {
    "match": {
      "contents": " All over the world "
    }
  }
}

Practice shows :

  • Inscription :0 The first

  • Yi :1 The first

  • All over the world :114 The first

Can not help but sigh : The sages of Tang poetry also cherish the world , Worry about the country and the people !

5.5.2 There are several poems by Li Bai ? According to the length of the poem , From short to long

POST some_index/_search
{
   "query": {
    "match_phrase": {
      "author": " Li Bai "
    }
  },
  "sort": [
    {
      "cont_length": {
        "order": "desc"
      }
    }
  ]
}


POST some_index/_search
{
  "aggs": {
    "genres": {
      "terms": {
        "field": "author.keyword"
      }
    }
  }
}


Among the 300 poems of the Tang Dynasty , Li Baigong 33 A poem ( Second only to Du Fu 39 The first ), The longest is “ Intones difficult ”, common :353 Characters .

Li Bai 、 Du Fu is worthy of : Shixian and Shisheng ! They are all high-yielding poets !

5.5.3 take TOP10 The longest 、 List of authors of the shortest poem

POST some_index/_search
{
  "sort": [
    {
      "cont_length": {
        "order": "desc"
      }
    }
  ]
}


POST some_index/_search
{
  "sort": [
    {
      "cont_length": {
        "order": "asc"
      }
    }
  ]
}

The longest poem : Bai Juyi - Everlasting regret -960 Characters .

The shortest poem : Wang wei - Luchai - 24 Characters ( There's a lot of juxtaposition ).

5.6 Aggregate analysis

The following screenshot goes through kibana Realization . The details are before kibana It's all explained in visualization .

5.6.1 Who has the most works ? take TOP10 Ranking

5.6.2 The proportion of five character quatrains and seven character metrical poems , And the corresponding author percentage statistics

5.6.3 Ranking poems of the same name

5.6.4 What kind of word cloud is formed by the word segmentation of 300 poems

5.6.5  Global view

6、 Summary

Combined with Tang poetry 300 The first business scenario , Combined with the needs of this small project 、 Design 、 Realize three stages , Set up the right Elasticsearch、kibana The overall understanding of the core knowledge points .

Core purpose : Practice through small projects , Promote the actual project capability of the company 、 The improvement of product R & D capability

reflection : The effect of the word cloud is not good , Why? ?


recommend :

dried food | Elasticsearch Index design practical guide

Elasticsearch Performance optimization guide

Open the door !Elasticsearch Official document offline access practice guide

dried food | Elasticsearch Best practice guide for developers

dried food | Elasticsearch Multi table Association Design Guide


more short time more Learn quickly more More dry !

China 40%+Elastic Certified engineers come from !

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top