编程知识 cdmana.com

Finish Kafka from the perspective of interview

Kafka It is an excellent distributed message middleware , Many systems will use Kafka To do messaging . Understanding and using distributed message system has become a necessary skill for a background developer . today Code byte From the common Kafka Start with the interview questions , Talk to you Kafka The things about .

Mind mapping

Let's talk about distributed message middleware

problem

  • What is distributed message middleware ?

  • What is the role of message middleware ?

  • What is the usage scenario of message middleware ?

  • Selection of message middleware ?

Message queue

Distributed message is a communication mechanism , and RPC、HTTP、RMI Wait for something different , Message middleware uses the way of distributed intermediate agent to communicate . As shown in the figure , After adopting message middleware , The upstream business system sends messages , It is stored in message middleware first , Then the message middleware distributes the message to the corresponding business module application ( Distributed producers - Consumer model ). This asynchronous way , Reduce the degree of coupling between services .

framework

Define message middleware :

  • Make use of efficient and reliable messaging mechanism for platform independent data exchange

  • Based on data communication , To integrate distributed systems

  • By providing messaging and message queuing models , It can expand the communication between processes in the distributed environment

Reference additional components in the system architecture , It is necessary to improve the complexity of the system architecture and the difficulty of operation and maintenance , that What are the advantages of using distributed message middleware in the system ? What is the role of message middleware in the system ?

  • decoupling

  • redundancy ( Storage )

  • Extensibility

  • Peak shaving

  • Recoverability

  • Sequence assurance

  • buffer

  • asynchronous communication

During the interview , Interviewers often care about their ability to select open source components , This can test the breadth of the interviewer's knowledge , It can also test the depth of knowledge of a certain kind of system , And you can also see the overall grasp of the system and the ability to design the system architecture . There are many open source distributed message systems , Different message systems have different characteristics , What kind of messaging system to choose , Not only need to have a certain understanding of the message system , Also need to have a clear understanding of their own system requirements .

Here is a comparison of several common distributed messaging systems :

choice

Answer key words

  • What is distributed message middleware ? signal communication , queue , Distributed , Production consumer model .

  • What is the role of message middleware ? decoupling 、 Peak processing 、 asynchronous communication 、 buffer .

  • What is the usage scenario of message middleware ? asynchronous communication , Message storage processing .

  • Selection of message middleware ? Language , agreement 、HA、 Data reliability 、 performance 、 Business 、 ecology 、 Simple and easy 、 Push pull mode .

Kafka Basic concepts and Architecture

problem

  • In a nutshell Kafka The architecture of ?

  • Kafka Push mode or pull mode , What's the difference between push and pull ?

  • Kafka How to broadcast news ?

  • Kafka Whether the news is orderly ?

  • Kafka Is read-write separation supported ?

  • Kafka How to ensure high availability of data ?

  • Kafka in zookeeper The role of ?

  • Support transactions ?

  • Whether the number of partitions can be reduced ?

Kafka General concepts in Architecture :

framework
  • Producer: producer , That is, the party sending the message . The producer is responsible for creating the message , Then send it to Kafka.

  • Consumer: consumer , That is, the receiving party . Consumer connected to Kafka Go up and receive messages , Then carry on the corresponding business logic processing .

  • Consumer Group: A consumer group can contain one or more consumers . Use multiple partitions + Multi consumer approach can greatly improve the processing speed of data downstream , Consumers in the same consumption group will not repeat the consumption message , alike , Consumers in different consumption groups don't influence each other when they send messages .Kafka It is to realize the message through the way of consumption group P2P Mode and broadcast mode .

  • Broker: Service agent node .Broker yes Kafka Service node of , namely Kafka Server for .

  • Topic:Kafka In order to Topic Divide the units , Producers send messages to specific Topic, And consumers are responsible for subscribing Topic And consume .

  • Partition:Topic It's a logical concept , It can be subdivided into multiple partitions , Each partition belongs to a single topic . Different partitions under the same topic contain different messages , Partition can be regarded as an appendable log at the storage level (Log) file , Messages are assigned a specific offset when they are appended to the partition log file (offset).

  • Offset:offset Is the unique identifier of the message in the partition ,Kafka It is used to ensure the sequence of messages in the partition , however offset Not across partitions , in other words ,Kafka It ensures the order of the partition, not the order of the topic .

  • Replication: copy , yes Kafka Ways to ensure high availability of data ,Kafka same Partition The data can be in many Broker There are multiple copies on , Generally, only the master copy provides external reading and writing services , When the master copy is in broker A crash or a network crash ,Kafka Will be in Controller Under the management of, we will re select the new one Leader Copy to provide external reading and writing services .

  • Record: Actual write Kafka And can be read from the message record . Every record Contains key、value and timestamp.

Kafka Topic Partitions Layout

The theme

Kafka take Topic partition , Partitions can read and write concurrently .

Kafka Consumer Offset

consumer offset

zookeeper

zookeeper
  • Broker register :Broker It's distributed and independent of each other ,Zookeeper It is used to manage all the Broker node .

  • Topic register : stay Kafka in , The same Topic The message will be divided into multiple partitions and distributed over multiple Broker On , These partition information and Broker The corresponding relationship between the two is also determined by Zookeeper Maintaining

  • Producer load balancing : Because of the same Topic Messages are partitioned and distributed across multiple Broker On , therefore , Producers need to reasonably send messages to these distributed Broker On .

  • Consumer load balancing : Similar to producers ,Kafka In order to achieve multiple consumers reasonably from the corresponding Broker Receiving messages on the server , Each consumer group contains several consumers , Each message is sent to only one consumer in the group , Different consumer groups consume their own specific Topic The following message , Mutual interference .

Answer key words

  • In a nutshell Kafka The architecture of ?

    Producer、Consumer、Consumer Group、Topic、Partition

  • Kafka Push mode or pull mode , What's the difference between push and pull ?

    Kafka Producer towards Broker Send message using Push Pattern ,Consumer Consumption uses Pull Pattern . Pull mode , Give Way consumer Manage yourself offset, Can provide read performance

  • Kafka How to broadcast news ?

    Consumer group

  • Kafka Whether the news is orderly ?

    Topic The level is out of order ,Partition Orderly

  • Kafka Is read-write separation supported ?

    I won't support it , Only Leader Provide reading and writing services to the outside world

  • Kafka How to ensure high availability of data ?

    copy ,ack,HW

  • Kafka in zookeeper The role of ?

    Cluster management , Metadata management

  • Support transactions ?

    0.11 Post support transactions , Can achieve ”exactly once“

  • Whether the number of partitions can be reduced ?

    Can not be , Data will be lost

Kafka Use

problem

  • Kafka What are the command line tools ? What did you use ?

  • Kafka Producer Implementation process of ?

  • Kafka Producer What are the common configurations ?

  • How to make Kafka The news is in order ?

  • Producer How to ensure data transmission is not lost ?

  • How to improve Producer Performance of ?

  • If the same group Next consumer Is greater than part The number of ,kafka How to deal with it ?

  • Kafka Consumer Is it thread safe ?

  • Tell me about your use of Kafka Consumer Thread model when consuming messages , Why is it so designed ?

  • Kafka Consumer Common configuration of ?

  • Consumer When will be kicked out of the cluster ?

  • When there is Consumer When joining or exiting ,Kafka How would you react ?

  • What is? Rebalance, When will it happen Rebalance?

Command line tools

Kafka The command line tool for is in Kafka Bag /bin Under the table of contents , It mainly includes services and cluster management scripts , The configuration script , Information view script ,Topic Script , Client script, etc .

  • kafka-configs.sh: Configuration management scripts

  • kafka-console-consumer.sh:kafka Consumer console

  • kafka-console-producer.sh:kafka Producer console

  • kafka-consumer-groups.sh:kafka Information about consumer groups

  • kafka-delete-records.sh: Delete low water level log files

  • kafka-log-dirs.sh:kafka Message log directory information

  • kafka-mirror-maker.sh: Different data centers kafka Cluster replication tool

  • kafka-preferred-replica-election.sh: Trigger preferred replica The election

  • kafka-producer-perf-test.sh:kafka Producer performance test scripts

  • kafka-reassign-partitions.sh: Partition reallocation script

  • kafka-replica-verification.sh: Copy progress verification script

  • kafka-server-start.sh: start-up kafka service

  • kafka-server-stop.sh: stop it kafka service

  • kafka-topics.sh:topic Manage scripts

  • kafka-verifiable-consumer.sh: Testable kafka consumer

  • kafka-verifiable-producer.sh: Testable kafka producer

  • zookeeper-server-start.sh: start-up zk service

  • zookeeper-server-stop.sh: stop it zk service

  • zookeeper-shell.sh:zk client

We can usually use kafka-console-consumer.sh and kafka-console-producer.sh Script to test Kafka Production and consumption ,kafka-consumer-groups.sh You can view and manage Topic,kafka-topics.sh Usually used to view Kafka Consumer groups .

Kafka Producer

Kafka producer The normal production logic consists of the following steps :

  1. Configure producer client parameters common producer instances .

  2. Building messages to be sent .

  3. Send a message .

  4. Close producer instance .

Producer The process of sending messages is shown in the figure below , Need to go through Interceptor , Serializer and Comparator , In the end by the accumulator Bulk send to Broker.

producer

Kafka Producer The following required parameters are required :

  • bootstrap.server: Appoint Kafka Of Broker The address of

  • key.serializer:key Serializer

  • value.serializer:value Serializer

Common parameters :

  • batch.num.messages

    The default value is :200, Number of messages per batch , Only right asyc Work .

  • request.required.acks

    The default value is :0,0 Express producer There is no need to wait leader The confirmation of ,1 Representative needs leader Confirm that it is written locally log And confirm immediately ,-1 On behalf of all the backup after the completion of confirmation . Only right async Patterns work , The adjustment of this parameter is to avoid data loss and transmission efficiency tradeoff, If you are not sensitive to data loss and care about efficiency, you can consider setting it to 0, This can greatly improve producer The efficiency of sending data .

  • request.timeout.ms

    The default value is :10000, Confirm timeout .

  • partitioner.class

    The default value is :kafka.producer.DefaultPartitioner, Must be realized kafka.producer.Partitioner, according to Key Provide a partition policy . Sometimes we need the same type of messages to be processed sequentially , So we have to customize the allocation strategy , So that the same type of data is allocated to the same partition .

  • producer.type

    The default value is :sync, Specifies whether the message is sent synchronously or asynchronously . asynchronous asyc It can be used wholesale kafka.producer.AyncProducer, Sync sync use kafka.producer.SyncProducer. Synchronous and asynchronous sending can also affect the efficiency of message production .

  • compression.topic

    The default value is :none, Message compression , No compression by default . Other compression methods include ,"gzip"、"snappy" and "lz4". Compression of messages can greatly reduce network traffic 、 Reduce network IO, So as to improve the overall performance .

  • compressed.topics

    The default value is :null, With compression set , You can specify a specific topic Compress , If not specified, compress all .

  • message.send.max.retries

    The default value is :3, Maximum number of attempts to send a message .

  • retry.backoff.ms

    The default value is :300, Additional interval time added to each attempt .

  • topic.metadata.refresh.interval.ms

    The default value is :600000, The time to get metadata on a regular basis . When the partition is lost ,leader Is not available producer They also take the initiative to obtain metadata , If 0, Then the metadata is retrieved every time the message is sent , Not recommended . If it's negative , The metadata can only be obtained in case of failure .

  • queue.buffering.max.ms

    The default value is :5000, stay producer queue The maximum time for caching data of , only for asyc.

  • queue.buffering.max.message

    The default value is :10000,producer Maximum number of messages cached , only for asyc.

  • queue.enqueue.timeout.ms

    The default value is :-1,0 When queue Throw away when full , The negative value is queue Full hour block, The positive value is queue Full hour block The corresponding time , only for asyc.

Kafka Consumer

Kafka There's the concept of a consumer group , Each consumer can only consume messages from the assigned partition , Each partition can only be consumed by one consumer in a consumption group , So if the number of consumers in the same consumption group exceeds the number of partitions , There will be areas where consumers can't allocate their consumption . The relationship between consumer groups and consumers is shown in the figure below :

consumer group

Kafka Consumer Client Consumption messages usually include the following steps :

  1. Configure client , Create consumer

  2. Subscribe to topics

  3. Pull the news and consume

  4. Submit consumer displacement

  5. Close consumer instance

The process

because Kafka Of Consumer The client is thread unsafe , To ensure thread safety , And improve consumer performance , Can be in Consumer The end is similar to Reactor To consume data .

Consumption model

Kafka consumer Parameters

  • bootstrap.servers: Connect broker Address ,host:port Format .

  • group.id: The consumer group to which the consumer belongs .

  • key.deserializer: producer-oriented key.serializer Corresponding ,key The deserialization method of .

  • value.deserializer: producer-oriented value.serializer Corresponding ,value The deserialization method of .

  • session.timeout.ms:coordinator Time of test failure . Default 10s This parameter is Consumer Group Active detection ( Members of the group comsummer) The time interval between crashes , It's like a heartbeat expiration time .

  • auto.offset.reset: This property specifies that the consumer is reading an offset that is not valid ( The current offset is obsolete and has been deleted for a long time ) In the case of partition , What should be done , The default value is latest, That is to read data from the latest record ( Records generated after the consumer starts ), Another value is earliest, If the offset is invalid , Consumers start reading data from the start .

  • enable.auto.commit: No auto submit displacement , If false, You need to manually submit the displacement in the program . For semantics that are accurate to one time , It's best to submit the displacement manually

  • fetch.max.bytes: The maximum number of bytes in a single pull

  • max.poll.records: A single poll The maximum number of messages returned by the call , If the processing logic is light , This value can be raised appropriately . however max.poll.records Data needs to be in session.timeout.ms This is the time to deal with . The default value is 500

  • request.timeout.ms: The maximum waiting time for a request response . If there is no response within the timeout period ,kafka Or resend the message , Or if it exceeds the number of retries, it will be set as failure directly .

Kafka Rebalance

rebalance It's essentially an agreement , Set a consumer group All under consumer How to reach an agreement to allocate subscriptions topic Every section of . For example, a certain group There are 20 individual consumer, It subscribes to a with 100 Partitioned topic. Under normal circumstances ,Kafka The average will be for each consumer Distribute 5 Zones . The process of distribution is called rebalance.

When rebalance?

This is also a frequently mentioned problem .rebalance There are three trigger conditions for :

  • Changes in group members ( new consumer Join the group 、 existing consumer Take the initiative to leave the group or have consumer collapsed —— The difference between the two will be discussed later )

  • The number of subscribed topics has changed

  • The number of partitions that subscribe to a topic has changed

How to allocate partitions within a group ?

Kafka Two allocation strategies are provided by default :Range and Round-Robin. Of course Kafka A pluggable allocation strategy is adopted , You can create your own allocator to implement different allocation strategies .

Answer key words

  • Kafka What are the command line tools ? What did you use ?/bin Catalog , management kafka colony 、 management topic、 Production and consumption kafka

  • Kafka Producer Implementation process of ? Interceptor , Serializer , Dividers and accumulators

  • Kafka Producer What are the common configurations ?broker To configure ,ack To configure , Network and send parameters , Compression parameters ,ack Parameters

  • How to make Kafka The news is in order ?Kafka stay Topic The level itself is out of order , Only partition It's up to order , So in order to ensure the processing sequence , You can customize the comparator , Send data that needs to be processed sequentially to the same partition

  • Producer How to ensure data transmission is not lost ?ack Mechanism , Retry mechanism

  • How to improve Producer Performance of ? Batch , asynchronous , Compress

  • If the same group Next consumer Is greater than part The number of ,kafka How to deal with it ? Redundant Part It's going to be useless , No consumption data

  • Kafka Consumer Is it thread safe ? unsafe , Single thread consumption , Multithreading

  • Tell me about your use of Kafka Consumer Thread model when consuming messages , Why is it so designed ? Pull and process separation

  • Kafka Consumer Common configuration of ?broker, Network and pull parameters , Heartbeat parameters

  • Consumer When will be kicked out of the cluster ? Run away , Network anomalies , Processing time is too long, submit displacement timeout

  • When there is Consumer When joining or exiting ,Kafka How would you react ? Conduct Rebalance

  • What is? Rebalance, When will it happen Rebalance?topic change ,consumer change

High availability and performance

problem

  • Kafka How to ensure high availability ?

  • Kafka Delivery semantics of ?

  • Replic The role of ?

  • What's up? AR,ISR?

  • Leader and Flower What is it? ?

  • Kafka Medium HW、LEO、LSO、LW What do they stand for ?

  • Kafka What is done to ensure superior performance ?

Partitions and replicas

Partition copy

In a distributed data system , Partitions are usually used to improve the processing power of the system , Ensure high availability of data through replicas . Multi partitioning means the ability to process concurrently , In these multiple copies , Only one is leader, And the rest are follower copy . have only leader The copy can serve the public . Multiple follower Copies are usually stored in and leader Copies are different broker in . High availability is achieved through this mechanism , When a machine goes down , other follower Copies can also be quick ” become a regular worker “, Start providing services to the outside world .

Why? follower Copy does not provide read service ?

This is essentially a trade-off between performance and consistency . Just imagine , If follower The copy also provides services to the public, so what will happen ? First , Performance is bound to improve . But at the same time , There will be a series of problems . It's similar to magic reading in database transactions , Dirty reading . For example, you now write a piece of data to kafka The theme a, consumer b From theme a Consumption data , But found that the consumption is not , Because consumers b To read the partition copy , The latest news hasn't been written in yet . And this time , Another consumer c But you can consume the latest data , Because it consumes leader copy .Kafka adopt WH and Offset To decide Consumer What data can be consumed , Data that has been currently written .

watermark

Only Leader It can provide reading services to the outside world , So how to vote Leader

kafka Will be associated with leader Keep the copies synchronized and put them in ISR In replica collection . Of course ,leader Copies are always in ISR In the set of copies , In some special cases ,ISR There's even... In the copies leader One copy . When leader Hang up ,kakfa adopt zookeeper Sensing the situation , stay ISR Select a new copy from the copy to be leader, External services . But there's another problem , As mentioned earlier , There may be ISR In replica collection , Only leader, When leader After the copy goes down ,ISR The set is empty , What do you do then ? At this time, if you set unclean.leader.election.enable Parameter is true, that kafka It's going to be out of sync , That is not in ISR In the replica in the replica collection , Take out the copy and make it leader.

The existence of replica will lead to replica synchronization problem

Kafka In all assigned copies (AR) Maintain a list of available copies in (ISR),Producer towards Broker The message will be sent according to ack Configuration to determine that you need to wait for several replicas to synchronize messages before they succeed ,Broker Internal meeting ReplicaManager Service to manage flower And leader Data synchronization between .

sync

performance optimization

  • partition Concurrent

  • Read and write disks sequentially

  • page cache: Read and write by page

  • read-ahead :Kafka The message to be consumed will be read into memory in advance

  • High performance serialization ( Binary system )

  • Memory mapping

  • unlocked offset management : Improve concurrency

  • Java NIO Model

  • Batch : Read and write in bulk

  • Compress : Message compression , Storage compression , Reduce the network and IO expenses

Partition Concurrent

One side , Because of the difference Partition Can be located on different machines , So we can make full use of the advantages of cluster , Parallel processing between machines . On the other hand , because Partition Physically correspond to a folder , Even if there are more than one Partition On the same node , You can also configure different nodes on the same node Partition Put in different disk drive On , So as to realize parallel processing between disks , Give full play to the advantages of multiple disks .

Sequential reading and writing

Kafka every last partition The files in the directory are evenly cut into equal sizes ( The default file is 500 mega , You can set it manually ) Data files for , Each data file is called a segment (segment file), Every segment All use append The way to append data .

Additional data

Answer key words

  • Kafka How to ensure high availability ?

    Ensure high availability of data through replica ,producer ack、 retry 、 Automatically Leader The election ,Consumer Self balancing

  • Kafka Delivery semantics of ?

    Delivery semantics generally have at least onceat most once and exactly once.kafka adopt ack To implement the first two .

  • Replic The role of ?

    Achieve high availability of data

  • What is? AR,ISR?

    AR:Assigned Replicas.AR After the theme is created , The collection of replicas that are allocated when the partition is created , Copies The number is determined by the replica factor .ISR:In-Sync Replicas.Kafka A particularly important concept in , It means AR Those with Leader Protect A collection of synchronized copies . stay AR The copy in may not be in ISR in , but Leader Copies are naturally contained in ISR in . About ISR, Another common interview question is how to judge whether a copy should belong to ISR. The current judgment Based on :Follower Replica LEO backward Leader LEO Time for , Whether it exceeds Broker End parameters replica.lag.time.max.ms value . If you exceed , Copies will be taken from ISR Remove .

  • Leader and Flower What is it? ?

  • Kafka Medium HW What is the ?

    High water level (High watermark). This is an important field that controls the range of messages that consumers can read . One An average consumer can only “ notice ”Leader The copy is between Log Start Offset and HW( Not included ) Between All the news . The above information is invisible to consumers .

  • Kafka What is done to ensure superior performance ?

    partition Concurrent 、 Read and write disks sequentially 、page cache Compress 、 High performance serialization ( Binary system )、 Memory mapping unlocked offset management 、Java NIO Model

This article is not in-depth Kafka Implementation details and source code analysis , but Kafka It's a Excellent open source system , Many elegant architecture design and source code design are worth learning , It is highly recommended that interested students learn more about this open source system , For their own architecture design capabilities , Coding ability , Performance optimization can help a lot .

 Recommended reading 
 I talked about Gao recently P, I'm in a panic 
 Ten year old farmers , I'll teach you how to write a resume on the spot 
 To show you the technical article , We broke our hearts ...
 Programming · thinking · In the workplace 
 Welcome to scan 

版权声明
本文为[singwhatiwanna]所创,转载请带上原文链接,感谢
https://cdmana.com/2020/12/20201224105612941s.html

Scroll to Top