REINFORCEMENT LEARNING-BASED ROUTING PROTOCOLS IN VEHICULAR AND FLYING AD HOC NETWORKS – A LITERATURE SURVEY

Vehicular and flying ad hoc networks (VANETs and FANETs) are becoming increasingly important with the development of smart cities and intelligent transportation systems (ITSs). The high mobility of nodes in these networks leads to frequent link breaks, which complicates the discovery of optimal route from source to destination and degrades network performance. One way to overcome this problem is to use machine learning (ML) in the routing process, and the most promising among different ML types is reinforcement learning (RL). Although there are several surveys on RL-based routing protocols for VANETs and FANETs, an important issue of integrating RL with well-established modern technologies, such as software-defined networking (SDN) or blockchain, has not been adequately addressed, especially when used in complex ITSs. In this paper, we focus on performing a comprehensive categorisation of RL-based routing protocols for both network types, having in mind their simultaneous use and the inclusion with other technologies. A detailed comparative analysis of protocols is carried out based on different factors that influence the reward function in RL and the consequences they have on network performance. Also, the key advantages and limitations of RL-based routing are discussed in detail.


INTRODUCTION
Modern life cannot be imagined without the usage of some type of wireless ad hoc networks (WANETs) with dynamic nodes that can participate in data packet routing. The most common dynamic WANETs are mobile ad hoc networks (MANETs), vehicular ad hoc networks (VANETs) and flying ad hoc networks (FANETs). Although with VANETs a wide range of services for intelligent transportation systems (ITSs) and smart cities can be provided, the lack of fixed infrastructure, as well as an unpredictable number of nodes in ad hoc scenarios can lead to significant limitations. One of the possible solutions is to use FANETs that provide temporary connectivity in cases of low vehicle density or supplement the missing fixed infrastructure. This will lead to complex and heterogeneous environments that include both VANETs and FANETs to ensure adequate quality of service (QoS). The process of choosing the optimal route from source to destination is a challenging task in these networks since their topology is constantly changing, which can cause frequent link breaks and performance degradation. In these conditions, traditional routing techniques show significant limitations, especially for application in dynamic heterogeneous networks. One possible solution that attracts a lot of attention from researchers is the application of machine learning (ML). The most promising type of ML is reinforcement learning (RL), which monitors network changes through constant interaction with the environment and, depending on the current network state, helps in selecting the optimal route, especially in heterogeneous highly dynamic ad hoc networks.
There are several survey studies related to the application of RL in VANETs and FANETs in the literature, among which [1][2][3] can be singled out in terms of quality and importance. The authors in [1] gave an extensive overview of RL-based routing protocols in VANETs, where protocols are first categorised by routing type and then compared based Current limitations, future trends and overall discussion are given in the fourth section. Concluding remarks are given in the last section.

REINFORCEMENT LEARNING
RL is the most common type of ML in routing protocols for dynamic WANETs. This type of learning is described in detail in [4] and involves learning through constant interaction with the environment to achieve a certain goal. The RL process in one WANET can be modelled in several ways. The most commonly used approach is that each node in the network that sends packets represents a learning agent, while the entire network represents the environment. Sending packets to one of the neighbouring nodes represents a potential action that the agent can take. Since each node has a finite set of neighbours, it represents a set of possible actions that the node can take. The feedback received by the sender contains a reward for the taken action and the new state of the environment. The reward may depend on various influencing factors, which are further discussed in the third section.
One of the simplest RL algorithms is Q-learning (QL) [4], in which each agent maintains a table of Q-values that refer to the usefulness of taking a specific action at a particular moment. Based on these values, the agent makes decisions about future actions. Q-values are updated after each action that an agent takes based on the current reward and the maximum possible Q-value that an agent can achieve in the following state. To improve the learning process, the DRL concept is introduced in [5], where the determination of Q-values is performed using a deep Q-network (DQN) that combines RL with a deep neural network (DNN). The input of DNN is typically the state of the environment, and the output is the optimal Q-value for the action taken in the appropriate state. RL is often unstable or even diverges when a neural network is used for the determination of the Q-values. To overcome these instabilities, two new ideas have been proposed in [5]. First, the experience replay mechanism is introduced, which stores the data collected in the memory from which the samples are randomly selected and used in the learning process, thus reducing correlations between data. Secondly, two DQNs are used, one to calculate action values, and the other to calculate the target values, thus reducing the correlation between them.
on multiple criteria, such as key protocol features, optimisation criteria, performance evaluation parameters and techniques and RL algorithm parameters. In [2], the authors presented an overview of the different applications of RL in FANETs, including the application in routing protocols, where the protocols are compared according to RL type, their advantages and disadvantages. However, a detailed comparative analysis of the protocols has not been performed. The authors in [3] focused on the application of deep RL (DRL) in VANETs, but no categorisation and comparative analysis of the protocols are given. It can be noticed that the available surveys treat VANETs and FANETs separately.
To have a more comprehensive view of the future application of RL in highly dynamic and heterogeneous networks for smart cities and ITSs, it is necessary to include both VANETs and FANETs in the analysis. Also, surveys are, unfortunately, quickly becoming obsolete, given a large number of new papers that are increasingly expanding the application of RL. Thus, several important RL-based protocols for VANETs and FANETs, proposed in recently published papers, are not included in the mentioned surveys. In addition, the protocols are not classified keeping in mind the very significant issue of RL integration with other techniques such as software-defined networking (SDN), blockchain etc. Therefore, this paper provides a comprehensive categorisation of recently published RL-based protocols for VANETs and FANETs, with special emphasis on the integration of RL with other techniques. The main goal of this survey is to present, in one place, different approaches to the application of RL-based routing in VANETs and FANETs. This can be very useful for researchers to see current developments in this area, determine the direction of their future research and gain new ideas for improving routing protocols using RL techniques in heterogeneous dynamic WANETs. Besides, this paper aims to point out the shortcomings and limitations of RL technology as well as to highlight the challenges that need to be resolved for its successful application.
The rest of this paper is organised as follows. In the second section, the basic principles of RL are explained. In the third section, routing protocols are categorised based on network type, RL type and possible application of some other technique, and a comparative analysis of protocols is performed. only in MANETs. However, with the growing use of VANETs and FANETs, in the previous decade the authors have proposed RL-based routing solutions for VANETs, and in the last few years increasing number of RL-based algorithms for FANETs can be found as well. The great expansion of protocols for VANETs and FANETs and their wide application in smart cities and ITSs are the main reasons why the focus of this research is on routing protocols in these networks. Table 1 shows the categorisation of these protocols based on the applied network type (VANET or FANET) and the applied RL type. Having in mind that in VANETs and FANETs RL can be often used in combination with some other technique, categorisation is done according to this criterion as well. Some protocols use blockchain and fuzzy logic (FL), while in several protocols the role of the decision-making agent in RL is played by the SDN controller. One representative of each category will be described in more detail.

RL-based routing protocols for VANETs
The first category in Table 1 consists of papers in which QL-based routing protocols are proposed, without combining with any additional technique. The hybrid routing algorithm (RHR) [9], which helps to solve the blind path problem in VANETs, is chosen as a typical representative of this category. This problem occurs in a situation when a certain route in the routing table still has not expired, but due to the high mobility of nodes, the next node on the route has already gone out of the range of the sender. The RHR protocol finds multiple routes to the destination and runs the RL mechanism for each route in the forwarding table so that if a link on the route breaks, it selects a new one as soon as possible. The QL algorithm is implemented in every node so that different selections of the next-hop represent appropriate states while receiving different types of packets related to the current next-hop represents corresponding actions. For the action taken in the given state, the nodes receive feedback in the form of a reward, depending on the packet type. If a broadcast packet is received, the route through which the packet arrived will get a negative reward, while in the case of a unicast packet, that route will get a positive reward. After that, the nodes calculate the Q-values and choose the next hop. In [9] authors showed that To further improve the performances and increase the stability of RL, in [6] the duelling DRL (DDRL) concept is proposed, which represents an improvement of the DRL algorithm, retaining the application of the experience replay mechanism and target DQN. This concept involves the usage of duelling deep Q-networks (DDQNs) to determine optimal Q-values. The basic idea of DDQN is that it is not always necessary to calculate the value of each available action. Therefore, the DDQN network architecture can be divided into two main components: the value function and the advantage function. The value function should represent how useful it is to be in a certain state, and the advantage function measures the relative importance of a particular action compared to other available actions. After a separate calculation, the results of these functions are combined to obtain a final Q-value.
Another type of RL used in routing protocols for dynamic WANETs is the SARSA [4] algorithm and its modification, SARSA-λ [7]. SARSA is very similar to the QL algorithm, except that Q-value is updated based on the current state of the agent, the action the agent chooses, the reward the agent gets for choosing this action, the next state that the agent enters after taking that action and, finally, the next action the agent chooses in its new state.
A characteristic of all mentioned algorithms is that they are not based on the model of the environment, i.e. they all belong to the group of model-free algorithms. In [8], the authors proposed a model-based RL (MBRL) algorithm that first needs to create an internal model of the environment, and, based on it, the optimal routing policy will be determined. In this way, the optimal policy is reached faster compared to the QL algorithm. However, with this approach, it is necessary to form a dynamic state transition model, and sometimes a reward model, before applying the algorithm itself.

RL-BASED ROUTING PROTOCOLS FOR VANETS AND FANETS
In this section, a categorisation of recently published papers in which RL is applied to improve routing protocols for highly dynamic WANETs is performed. The focus is on papers published since 2018, in order to include the most current research in this field. For many years, researchers have been publishing papers based on RL with applications until they reach their destination. In the RL process, vehicles in the network represent the states in which the agent can be, while sending packets from one vehicle to another is a possible action. When it receives the packet, the vehicle checks its Q-table and, if it knows the route to the destination, updates the table, forwards the packet and receives a positive reward. Otherwise, it drops the packet and receives a negative reward. The value of the reward is affected by the distance to the destination vehicle. The proposed algorithm increases the stability and lifetime of the clusters, and also improves network performance in terms of average delay and throughput (TH), as shown in [21].
The third category is characterised by the application of QL and blockchain techniques, and a representative of this category is the QLASS [22], which proposes a security framework for stimulating the cooperative behaviour of onboard units (OBUs) in VANETs to protect the network from potential attacks. The framework is tested on a network that consists of one roadside unit (RSU) and several OBUs. OBUs can help each other by following neighbouring OBUs requests, but can network performances are improved in terms of packet delivery ratio (PDR), round trip time (RTT) and overhead (OH).
An adequate representative of the second category is the adaptive self-learning clustering algorithm with reinforcement routing in SDN-based VANETs (RL-SDVN) [21], which combines the application of the QL algorithm and SDN technique for clustering and finding the optimal route. The main goal of RL-SDVN is to improve the message dissemination process and reduce the average data transfer time. The first step in the proposed algorithm is the formation of clusters and assignment of vehicles to the appropriate cluster, based on connectivity with other vehicles, their distance, the transmission range of each vehicle and the number of packets in the queue for processing in a particular vehicle. Vehicles with high connectivity and low processing queue occupancy will be selected for the cluster head (CH) nodes. Based on the quality of the corresponding routes, the SDN controller, as an RL agent, searches for the best route to the destination. The learning process is repeated for each vehicle that has packets to forward QAGR improves end-to-end delay (E2ED), PDR and hop count (HC), as shown based on the simulations done in [23]. The fifth category includes papers based on DRL, without a combination with other techniques. One of the papers in this category is DRLV [27], in which DRL is used to establish and select the best routes in the VANET. The scenario for which the proposed model is created involves vehicle-to-infrastructure communication, where a particular RSU covers one area of the network. The entire network is divided into clusters so that each cluster has its vehicle density. Changes in vehicle density are predicted using the DRL model, trained based on vehicle speed and movement. The first phase in the proposed approach is establishing the routes using DRL, based on the location of the vehicles, the distance to the nearest RSU, vehicle density and the delay. Factors that can help in choosing the appropriate action at this stage are the capability of packet delivery along the route, the total number of routes that exist between the source and destination node and the cumulative weight of each route. The second phase is route selection, in which the nodes choose the best nexthop using the DRL. The learning agent first predicts possible transitions from one state to another based on previous events. In this way, the optimal routes for forwarding the packet to the destination are predicted. Based on that, the agent takes the appropriate action, which changes the state of the environment, and receives the appropriate reward. The reward depends on the ratio of the maximum link utilisation in the case of using the current routing strategy and the optimal link utilisation. The authors in [27] showed that this model improves PDR, E2ED and OH.
Software-defined trust-based DRL framework (TDRL-RP) [29] is the chosen representative of the sixth category in Table 1. TDRL-RP uses a combination of DRL and SDN techniques to help find the optimal route and calculate its reliability. In the proposed approach, the role of a learning agent in the DRL is played by a centralised SDN controller, which helps in selecting the best next hop. The state of the environment includes a set of states of all vehicles that include the position and forwarding ratio of each vehicle. A potential action in the appropriate state of the environment is the agent's choice of a neighbour to which a certain vehicle should forward packets. The reward for the action depends also be selfish and try to maximise their benefit by acting maliciously or may attack the network if it can obtain an illegal gain. OBUs learn coordination behaviour in the network by applying actions to other OBUs according to their reputations. Reputation is an important parameter shared between nodes in the network and protected using the blockchain mechanism. If an OBU does not participate in attacks, its reputation grows, and the probability that neighbouring OBUs will follow its requests will be higher. Every OBU uses QL to choose the optimal action to obtain maximum benefit. Actions can include jamming, spoofing, eavesdropping, disobeying and following the request, while the environment includes node reputation, location and speed. The authors in [22] showed that this approach has good performances in terms of PDR, reputation and utility of network nodes.
The fourth category in the Table 1 consists of papers based on the application of QL and FL. An example of such a paper is the QL-based adaptive geographic routing approach (QAGR) [23], which requires the inclusion of unmanned aerial vehicles (UAVs) in the routing process. The routing scheme consists of the aerial and ground components. Within the aerial component, UAVs create a global route using the FL and depth-first-search [48] algorithms, to ensure that vehicles do not send packets in the wrong direction. The selection of the optimal global route is influenced by the average number of vehicles in a certain area and their average speed. The information about global route is sent by UAVs to the appropriate vehicle and is used as a filter to reject deviated and congested neighbours when choosing the next hop. Within the ground component, vehicles choose the optimal next hop based on QL, following the Q-table filtered by the global route. The QL is modelled so that each state consists of the geographical area of a particular vehicle, the distance from the vehicle to its neighbour, and the number of neighbours of the neighbouring vehicle. A learning agent can be any vehicle, and a possible set of actions that an agent can take includes sending packets to one of the neighbouring vehicles. The reward the agent receives for a particular action depends on the received signal strength (RSS), transmission distance and collision between vehicles. The selection of appropriate actions is made based on Q-values.
packets. After taking action, the agent receives a reward that depends on the network throughput and throughput of the blockchain system. Based on the reward, the agent computes Q-value using DDRL with prioritised experience replay. Block-SDV increases the TH in the VANETs, as shown in [34].
A representative of the ninth category is an RLbased routing protocol for clustered EV-VANET (RLRC) [7], which uses the SARSA-λ learning algorithm. In the proposed approach, the entire network represents an environment, divided into an appropriate number of clusters. Each cluster has a CH node, and the learning process is started only for these nodes. To be selected for CH the vehicle must have available bandwidth (BW) and residual power above a predefined threshold. The vehicle that has packets for another vehicle sends those packets to its CH, its CH forwards them to the neighbouring CH using the SARSA-λ algorithm, and the neighbouring CH forwards the packets to the destination vehicle. The learning agent can be any CH node, and the set of states for a particular agent is the set of all other CHs in the network. The action that the agent can take is the selection of the appropriate CH to forward the packets. The reward for the action will have the maximum value if the current node is a neighbour of the destination node, and the minimum value if the current node does not have the next hop. In other situations, the reward depends on the HC, the link utility and the available BW. CHs periodically exchange Hello packets to update Q-values. The authors showed in [7] that applying the proposed protocol increases PDR and decreases HC.
The tenth category of papers is characterised by the application of MBRL and FL in routing protocols, and the appropriate representative is the reinforcement routing protocol for VANETs (RRPV) [8]. RRPV is based on the multi-agent RL (MARL) technique, which means that all nodes in the network represent learning agents that cooperate and at the same time try to find the optimal routing policy. The RRPV protocol consists of model learning and RL, which operate simultaneously. The FL system is used for learning and creating a model of the environment. The main goal is to create a state transition model and a reward model based on network quality, affected by connection stability (which depends on the speed and direction of nodes) and connection quality (which depends on on the reliability of the vehicle, affected by the forwarding ratios of control and data packets. DRL uses a convolutional neural network whose input is the state of the environment, and the output is the corresponding Q-value, based on which the agent selects the optimal route. Applying the proposed approach improves PDR and TH, as shown in [29].
The seventh category includes [33], which combines DDRL and SDN techniques to find the optimal route for data transmission. This algorithm is similar to the one proposed in [29], with the difference that it uses DDRL to train a learning agent. The neural network used to calculate the Q-values is divided into two streams, the first for calculating the value function, and the second for calculating the advantage function. These two functions represent the two components of the Q-value in this algorithm. The first component indicates the value of the corresponding state, and the second is the additional value achieved by taking a certain action in a given state. In [33] is shown that the proposed approach improves TH and E2ED.
A representative of the eighth category is a blockchain-based distributed software-defined VANET framework (block-SDV) [34] that combines the application of DDRL, SDN and blockchain techniques to establish a reliable architecture for communication management in VANETs. Block-SDV consists of three layers: device (DL), area control (ACL), domain control (DCL) and an edge computing server. The DL is formed of vehicles, while the ACL consists of SDN controllers that collect information about vehicles and links between them. Collected information is sent to the DCL, formed of SDN controllers that work in a distributed blockchain manner. The DCL is connected to the blockchain system, consisting of several blockchain nodes, among which there is one primary node that is responsible for client requests and several consensus nodes that control other nodes. Each SDN controller on the DCL represents a learning agent. The state of the environment depends on the trust features of the vehicles and the nodes in the blockchain system, the computing resources of the edge computing server, as well as the number of consensus nodes in the blockchain system. The set of actions taken by the agent includes the choice of the primary blockchain node, the edge computing server as a computing resource, the number of consensus nodes and reliable neighbouring vehicles for forwarding A representative of the twelfth category is a routing protocol based on QL and FL (QL-FLRP) proposed in [44]. Determination of the optimal route is done with the help of link-related parameters, which refer to an individual link, and path-related parameters, which refer to the entire route from the source to the destination. Link-related parameters include transmission rate (TR), energy state and flight status (depending on the speed and direction of the node), while path-related parameters include hop count and successful packet delivery time (SPDT). The FL system first finds the route to the destination based on the link-related parameters, after which it is possible to determine the path-related parameters. The QL algorithm calculates Q-values for path-related parameters and sends them back to the sender node. All collected parameters on the entire route represent the environment in the QL; each node that has packets to send represents an agent that changes the state by taking a certain action (selects the next node). Rewards, which affect the calculation of Q-values, are influenced by hop count and SPDT. Finally, based on both types of parameters, the optimal route is determined, using the FL system. The proposed protocol improves TR, HC and the remaining energy of nodes in the network, which is proved by the simulations done in [44].
The thirteenth category is characterised by the use of DRL in the routing protocol, and the representative of this category is the DRL-based adaptive and reliable routing protocol (ARdeep) [45]. In ARdeep the environment consists of all network nodes, and each node that has packets to send is a learning agent. For the learning agent, the state of the environment is represented by the status of all links to its neighbours. The status of each link is formed based on the expected connection time of the link, packet error rate (PER), remaining neighbour energy, the distance between neighbour and destination, and minimum distance between a two-hop neighbour and destination. The action that an agent can take is to select one of the neighbouring nodes to forward the packets. Each neighbouring node is detected by periodically sending Hello messages, which contain information about its position, speed and remaining energy. Based on the state of the environment, the agent selects the appropriate action with the help of DQN, whose input is the status of the appropriate link, and the output is its Q-value. After calculating the the ratio of sent and received control packets). The optimal routing policy is determined based on the created model of the environment, with the help of RL. Within RL, each node that has packets to send represents a learning agent that can change the state of the environment by taking a certain action. Sending packets to the agent's neighbours represents a set of available actions. When receiving a particular packet, the node evaluates links to all of its neighbours based on a previously created model of the environment, then calculates Q-values and selects the appropriate action based on the routing policy. For the taken action, the agent receives a reward that depends on the distance and quality of links between nodes (determined in the model learning process). Based on the simulations done in [8], this protocol improves PDR, E2ED and OH.

RL-based routing protocols for FANETs
Papers that propose routing protocols for FANETs based on QL, without the application of other techniques, are classified in the eleventh category in Table 1. A representative of this category is the QL-based message prioritising and scheduling algorithm (QMPS) [36], in which messages exchanged in the network are first classified into delay-sensitive and delay-tolerant. This is done so that in case of network congestion or degradation of link quality (LQ) delay-sensitive messages have a higher priority. Delay-sensitive messages include various types of command and coordination messages that have strict delay requirements and whose timely transmission greatly affects the reliability and security of the network. Delay-tolerant messages include various messages that can tolerate increased delay and packet loss. The QL algorithm has the role of dynamically assigning different priorities to different message types. Each node in the network is a learning agent, which takes a certain action in the form of assigning the appropriate priority for sending delay-tolerant messages. The reward for the action is formed based on two metrics: the first, which represents the percentage of delay-sensitive messages in the message queue, and the second, which depends on the probability of successful reception of the message of the neighbouring node. As shown in [36], the QMPS algorithm improves E2ED, TH and PLR of delay-sensitive messages. the basic optimisation goal of the routing process. Some of the most common factors are the link reliability (LR) and LQ to the potential next hop, the number of hops required to deliver the packet to the destination, available BW, achieved TH, delay, node speed, distance to the destination etc. It is often very important whether the next node is also the destination, as well as if the next node knows the route to the destination. When the goal of the protocol is to optimise energy consumption (EC), energy loss will be an important influencing factor. On the other hand, if the emphasis is on protection against unwanted external interference, important factors will be the reputation of the next node on the route and the detection of jammers near that node. Performance evaluation of the proposed protocols is done using different simulation environments, and some of the most common are network simulator 3 (ns3), network simulator 2 (ns2), optimised network engineering tools (opnet), python, qualnet, matlab, objective modular network test-bed in C++ (OMNeT ++), TensorFlow (TF) etc. Depending on the optimisation goal, different network performance metrics are used in the simulations, such as PDR, PLR, E2ED, TH, BW, HC, OH etc. Energy consumption and link connectivity (LC) are particularly important metrics when evaluating network performances in FANETs.

DISCUSSION
Following the development of modern cities and ITSs with high security and QoS requirements, we believe that future solutions will largely rely on heterogeneous dynamic WANETs that include fixed and ad hoc architecture with the addition of blockchain, SDN and other technologies. By analysing the literature from this survey, it can be seen that the emerging RL-based routing can achieve better network performances than traditional algorithms in both VANETs and FANETs and provide prosperous integration with other technologies. With RL, important changes in the network can be detected in real-time, which makes this technology very suitable for use in complex highly dynamic heterogeneous networks. However, RL is a new and complex technique that should be applied adequately in order to exploit its potentially very large benefits. This technology is still the subject of intensive research, and there are many open questions and limitations to overcome. One of the dilemmas that can be observed is the selection of the appropriate Q-value, the agent forwards the packet to the neighbour with the highest Q-value. The reward that an agent receives has the maximum value if the neighbouring node is the destination and the minimum value if all neighbours of the forwarding node are further away from the destination. In other situations, the reward depends on the distance to the destination node, LQ, remaining energy and initial energy of the neighbour. The authors in [45] showed that ARdeep improves PDR and E2ED.
A representative of the last category from Table 1 is FLRL [47], which uses FL and DRL for determining the optimal route in FANET. The FL system aims to determine the best relay node for packet forwarding, based on delay measure (depends on the distance to the relay node), stability rating (depends on the speed of the current and neighbouring nodes), and bandwidth efficiency (depends on the total number of nodes involved in the communication). In this way, it is possible to find a route to a destination with the help of FL, but this route may not be the best. Therefore, in addition to FL, DRL is also used. In the DRL algorithm, each node represents a learning agent, and the state of neighbouring nodes is known based on the FL. The action that an agent can take is to send packets to one of the neighbours and it consequently receives the appropriate reward. Based on FL, the reward will be 0 if the neighbour is best (optimal), and -1 if the neighbour is sub-optimal. Moreover, the reward will have a minimum value if it is not possible to establish a link to a neighbour, and a maximum value if it is a destination. It is then possible to calculate Q-values, based on which the optimal relay node is selected. In this way, hop count and connection quality are included in the route selection. This algorithm improves link connectivity and HC, as shown in [47].

Comparative analysis of RL-based routing protocols for VANETs and FANETs
The analysis of the previously described RLbased protocols shows that their success mostly depends on the appropriate design of the reward function. Therefore, a comparison of RL-based routing protocols for VANETs ( Table 2) and FANETs ( Table 3) is based on the influencing factors that determine the reward function. Furthermore, the comparison is done by the simulation software and the obtained network performance metrics. Various influencing factors are used in different studies, depending on the most common approach is to choose this central device as the learning agent, while all network vehicles or UAVs form the environment. In the distributed ad hoc networks, the common solution is that all nodes are used as agents, while in the cluster-based routing algorithms CH usually takes a role of the agent. In order to further improve network performance, RL can be used in combination with some RL type for the given routing problem. By analysing the latest literature (Figure 1a), it can be seen that most of the authors (65.85%) use QL, 21.95% use DRL, both DDRL and MBRL use 4.88% of them, while the SARSA algorithm is applied in only one protocol (2.44%). In addition, authors still search for the optimal definition of the learning agent, its states and actions. When the network is centralised,  changes in the environment. Too fast convergence can lead to instability and frequent changes in the selected routes, while too long convergence time leads to selection of sub-optimal routes. Another important factor of the learning process that influences the choice of the optimal route is the balance between the exploitation of acquired knowledge and the exploration of the environment due to its frequent changes. The most commonly used action selection policy is ε-greedy in which an agent with probability ε takes the action with the highest Q-value, while with probability (1-ε) selects a random action to explore the environment. Unfortunately, in most papers, not enough attention is paid to the optimal choice of parameters α, γ and ε; instead, typical values are adopted based on previous positive experience in other fields of application. Another important aspect in proposing new protocols is the process of their evaluation. Certainly, the best method of protocol validation is test-bed experiments that use a real-life setup for data collection. However, none of the analysed papers used this approach, instead, various simulation environments were used to evaluate the results. As can be seen from Figure 1c, most authors use open-source simulators or create their own simulation environment.

CONCLUSION
In this paper, an overview and classification of the RL-based routing protocols for VANETs and FANETs published since 2018 are provided. The protocols are classified into several categories based on network type, RL type and combination of RL with some other techniques. One chosen protocol from each category is explained in more detail. A comparative analysis of routing protocols is also given based on influential factors that determine the value of the reward in RL and network performance metrics used in simulations. However, a few limitations had to be adapted. Considering the current trends in this area, our classification is limited to the last couple of years, bearing in mind that the number of research papers is increasing every year. MANETs are not included in this survey, but considering the extensive experience in the application of RL-based techniques in this type of networks, they will certainly be the subject of our future research. In addition, although RL dominates in routing applications, there are certain possibilities of applying supervised and unsupervised tech-other technique, such as FL, SDN and blockchain, but most of the authors still do not use this possibility (Figure 1b).
Having in mind that the QL technique is relatively simple, and that has a table approach in the algorithm implementation, it is suitable for relatively small ad hoc networks, so most of the routing protocols analysed in this survey are limited to application in this network type. Since in these networks the learning algorithm is distributed among all nodes, which already have routing tables, storing data in Q-tables is a straightforward extension. But this approach is not an adequate solution for complex networks with a large number of nodes because the action-value space will grow exponentially. In those cases, DRL or some method for Q-table limitation should be used. Implementation of the DRL algorithm needs high computation resources and challenging convergence time, so it is more suitable for networks with centralised entities such as SDN or cluster-based networks with RSUs. Practical application of those techniques must carefully consider the security aspect as well. Although a centralised approach is a very good solution, in recent studies the authors are considering the integration of blockchain technology that provides a distributed trust management system. Currently, fewer authors use DRL, especially if it includes some other technique, but the number of DRL-based protocols constantly increases.
Besides the most important issue of selecting RL type, different approaches to defining the reward can be found (Tables 2 and 3), which obviously depend on the parameters that need to be optimised. When forming the reward, the agent relies on various feedback mechanisms that typically involve the exchange of additional control packets to determine the LQ or similar QoS parameters, which increases the routing overhead. Unfortunately, this cannot be avoided, but it is necessary to consider the possibility of using hierarchical routing that limits the area for the exchange of control packets, thus reducing the routing overhead.
One of the major challenges in RL applications is the convergence of the learning algorithm. The learning process is influenced by two key parameters: learning rate α and the discount factor γ, which determines the importance of future rewards. It is very important to carefully choose optimal values of these parameters to provide for proper functioning of the learning process and timely adaptation to obećava učenje potkrepljivanjem (RL niques that are also not covered. Having in mind that in future ITSs and smart cities implementation of both VANETs and FANETs will be necessary, we wanted to give a comprehensive survey of the current state of the art for RL-based routing in both networks in one place, which should be a useful research base ground. In addition, in this survey, papers that include both RL and other key techniques such as blockchain, SDN and FL for routing in VANETs and FANETs are analysed. In all of the analysed papers, the authors reported significant improvement in the observed performances, compared to the performances achieved using traditional routing protocols. Based on this state-of-the-art research we can conclude that the application of RL in routing protocols yields very good results for networks with high-speed nodes and frequent topology changes. Therefore, it can be expected that in the following years more RLbased routing solutions will emerge, especially based on DRL and in applications that use both VANETs and FANETs. Based on this research, we plan to propose a new solution for an RL-based routing protocol that will provide easier integration of VANETs and FANETs into highly dynamic heterogeneous networks. We are certain that this paper will serve as a good starting point for other researchers as well in the field of RL application in networks that include both VANETs and FANETs.