Federated-Based Deep Reinforcement Learning (Fed-DRL) for Energy Management in a Distributive Wireless Network

: Studies on developing future generation wireless systems are expected to support increased infrastructure development and device subscriptions with densely deployed base stations (BSs). Economically, decreasing BS energy consumption levels and achieving “ greenness ” remain key factors for the giant industry. Some research works have proposed deep reinforcement techniques to solve energy management (EM) issues in cellular networks. However, these techniques are inefficient in a distributive network environment and expose the devices to privacy issues. Federated learning (FL) is proven to enforce device privacy and train models distributively. Thus, this work proposes an autonomous switching mode framework for BSs based on federated-deep reinforcement learning (Fed-DRL) to address the aforementioned challenges encountered by prior studies. Specifically, we deploy multiple DRL agents to influence the decision of the BS for EM. On the other hand, to make DRL-based decisions feasible and satisfy device quality-of-service, we train the DRL agents distributively by employing the FL concept. The results show the effectiveness of our proposed framework under distributed network scenarios compared with other benchmark algorithms.


Introduction
Energy management (EM) is a key objective in next-generation networks.Studies reveal that base stations (BSs) consume 70-80% of operational energy, which account for 3% of the total energy produced globally and 2-4% of carbon dioxide (CO2) emissions, doubling the energy consumption (EC) rate of 15-20% annually (Githiru et al., 2011).According to Habibi et al. (2019), data traffic density foretells a 1000-fold increase in the next decade.This has motivated researchers to concentrate on EM in wireless networks.As the telecommunications industries sought to promote "greenness," innovative ideas (such as EARTH and Green Touch projects) were initiated (Ahmed et al., 2017).
Researchers came up with an alternative relay-station-based energy-efficient (EE) switching algorithm that turns off BSs during low-traffic intervals.Device association and resource allocation problems were considered in Zhuang et al. (2016) and Mesodiakaki et al. (2014).The model-based technique is set to achieve an EE solution by augmenting a particular independent task for the real-time slot.In Ashraf et al. (2010), sleep strategies were proposed to regulate devices and core networks to optimize power consumption in cellular networks.With the ability to utilize bandwidth more effectively than current cellular communication technologies, cognitive radio (CR) has recently emerged as the most promising next-generation communication technology.The precision of the sensing findings has a significant impact on the CR system's performance.But sensing ambiguities like false alarms and missed detections result in underuse of the spectrum and significant interference to the main user, respectively.In order to solve the problem of sensing ambiguities, the typical frame structure was modified in this research by adding two sensor slots as well as a gearbox slot.Sensing results are recorded in a flag bit up to that period, and the initial sensing slot is kept small and fixed for the specified likelihood of detection (Bala & Ahuja, 2023).When the flag bit status in the current frame differs from the previous frame, the second sensing slot (optional) is utilized.The second sensing slot was optimized to increase the effective throughput and energy efficiency of the secondary communication system while taking the trade-off between sensing throughput and energy efficiency into account.The simulation results are shown to demonstrate the viability of the suggested system, which uses a redesigned frame structure to outperform existing schemes in terms of throughput and energy efficiency.
Recently, reinforcement learning (RL) schemes became one of the techniques researchers adopted to solve EM problems in cellular networks (Cheng et al., 2017).Most of these optimization policies focused on deep reinforcement learning (DRL) and Q-learning, taking into account the load on BSs.Liu et al. (2018) use a Deep Q-Network (DQN)-based ON/OFF policy for BSs to control how much energy they use.However, these techniques are inefficient in a distributive network environment.This is because of the unpredictable nature of devices on the network which can be problematic because device mobility impacts the network signal at each period.
Federated learning (FL) (Jiang et al., 2020), as introduced, collaboratively trains a model under the moderation of a main server while protecting the decentralized training set (or distributed) without necessarily exposing its private attributes.It serves as a wrapper over the well-known traditional machine learning (ML) techniques.Its mechanism employs centralized datasets to train similar models that have the ability to train models distributively, record low power consumption, and ensure device privacy.For this reason, our approach solves the EM problem and considers the device quality-of-service (D-QoS) requirement.

Contribution
In this work, we propose federated-DRL (Fed-DRL), an autonomous switching mode framework for BSs that employs FL to train DRL agents in a distributive manner.FL improves learning performance, ensures device-data privacy, and solves the EM problem.We formulate the switching mode problem as a Markov decision process (MDP), where we define states, actions, rewards, and the next (future) states.Finally, we optimize EC and meet the D-QoS satisfaction requirement with the smallest quantity of active BSs.Extensive simulations and comprehensive analysis are presented in terms of convergence rate, D-QoS satisfaction, and EC in a distributed fashion.The effects of Fed-DRL and the traditional DRL agent are both examined.
The work is structured as follows: Section 2 provides knowledge work, and Section 3 describes the system model.Section 4 dealt with problem formulation to solve the BS's EM and provide D-QoS satisfaction.Furthermore, we describe our Fed-DRL framework in Section 5. We provide simulation results and analysis in Section 6 and conclude with Section 7.

Knowledge Work
Decreasing operational expenditure (OPEX) has been a key objective in telecommunications, since BS EC increases daily.Wang and Zheng (2015) proposed an EM procedure to predict and adjust the traffic load of BSs according to the mobility of the device.Although EM procedures are shown, there is a high level of EC due to the highly complex and stochastic existence of distributive training.
Recently, RL models have made a lot of progress and have become an interesting area for reducing energy use.In Chen et al. (2021), Hoffmann et al. (2021), Hsieh et al. (2021), Kim et al. (2022), Lee et al. (2020), andSun et al. (2020), deep learningbased aiding algorithms were introduced to control features in an end-to-end fashion.Data-driven schemes were applied for various optimization sections in wireless communication problems (Li et al., 2018;Sheng et al., 2021;Sun et al., 2020;Xiong et al., 2020).These researchers introduced several learning systems to monitor data consumption rates and determine the EE of devices on the network.These developments inspire many researchers to use DRL-based resources to learn successful BS sleeping strategies.Liu et al. (2018) suggested improving the typical DQN model with action-wise replay capability and flexible compensation balancing to solve non-stationary traffic challenges.However, the distributive network scenario's high-dimensional state and action space can impact the efficacy of the aforementioned methods.To minimize the network's highdimensional state space and activity, Li et al. (2014) suggested a method based on the actor-critic approach to derive the ON/OFF technique of the BS for the EC issue in the network, and they also included transfer learning (TL) in the actor-critic algorithm to utilize information gained over time.Sharma et al. (2017) also use a TL method for the ON/OFF toggling of BSs in diverse cellular networks.In Nishio & Yonetani (2019), FL was used to randomly choose clients with resource restrictions, allowing the server to combine as many clients' updates as possible and expedite the performance improvement of ML models.To lower OPEX in EM, they analyzed the issue of deploying FL in a cellular network utilized by heterogeneous devices with diverse data resources, computing capabilities, and wireless channel conditions.
In actual use, battery-powered IoT devices complete local training and communicate wirelessly with the main server.However, the constant communication between IoT devices and the main server would require many resources.The authors propose using the intelligent reflecting surface (IRS), a newly developed technology, to re-organize the wireless propagation environment to use the most available resources.In particular, we focus on the crucial problem of energy efficiency in the reconfigurable wireless communication network.Zhang and Mao (2021) propose an energy minimization issue in an IRS-assisted FL system subject to the complete training time restriction.The parameters are jointly configured using an iterative resource allocation technique with quick convergence.
The authors also adopt the FL framework for computation offloading optimization and prove their EM problem (Han et al., 2019;Jiao et al., 2020;Ye et al., 2020).Due to the lack of terrestrial connectivity and the limited battery life of FL users, certain FL tasks may not be possible.Pham et al. (2022) use unmanned aerial vehicles (UAVs) and wireless powered communications (WPCs) for FL networks to solve these issues.The UAV with edge computing and WPC capabilities is deployed as an aerial energy source and as an aerial server to carry out FL activities in order to provide sustainable FL solutions.They put forth the energy-efficient FL (E2FL) algorithm, a combined algorithm of UAV placement, power control, transmission time, model accuracy, bandwidth allocation, and computing resources, with the goal of reducing the overall EC of the aerial server and users after solving the original no convex problem effectively.
Although BS scheduling and EM problems in wireless networks using these techniques have been investigated, only some considered training DRL agents in a distributed fashion.Our work utilizes a federated-DRL-based framework that adjusts preferences for BS EC while still meeting the D-QoS satisfaction requirement.

Network model
With a collection of BSs as: Every BS k 2 K; is linked using devices i 2 I ¼ f1; 2; 3; . . .; Ig.Assume each device is made up of a local set data element.Every represents the input vector of the device i while the output is y il respectively.Each device trains a local-FL (L-FL) model on its own dataset.The input of all L-FL models is used to create a global-FL (G-FL).For each BS, B denotes framework bandwidth, and P t k denotes total energy utilization for transmission in watts.The path loss in this scenario is computed as: where F signifies frequency bandwidth while d ik means the interval between devices i and BS k.We adopt a channel model g ik as (Buzzi et al., 2016): where b ik depicts its channel gain, and ϕ denotes the enormous scope shadow blurring.As the standard deviation and the Gaussian arbitrary variable are signified by σ, PL is indicated as PLðd ik Þ.One important use case in our network is the ability to manage transmission between BSs and devices.A device can only be connected to one BS.However, a beam forming system can link a device to multiple BS in edge computing.In this study, we implement a maximum received signal power (MRSP)-based device combination model (Tabassum et al., 2014) to help associate devices on the network with a BS.MRSP is the conventional device affiliation system in which the device decides on the BS from which the full instantaneous signal power is received.We enable a down-link broadcasting rate for devices beyond loss of consensus as in Tian and Jiang (2021).The signalto-interference-plus-noise ratio (SINR) of a device i aligned to a BS k is denoted as: where ω ik denotes the beam-formed weights from the BS k to the device i and the spectral density, σ 2 , of an added substance white Gaussian variable.The successful data rate can be resolved using a characterized channel transfer bandwidth, B, and SINR as in Tian and Jiang (2021).Adopting the Shannon capacity formula (SCF), the transmit rate r ik of the device i linked with the BS k as: where x ik is a portion of the BS bandwidth B allotted to the device i.
For each BS k, an M/G/1 processor sharing system was used, with packets arriving in the Poisson process (PP) with parameter λ ik : The resource time for device i and BS k is indicated with boundary , and the standardized feasible rate is r Ã ik .The average packet size is denoted by λand is expressed as the mean.The entire time a device demands a resource while in procession for BS k is represented by the average delay τ ik .The delay's average encountered by the device i on BS k is indicated by the property of M/G/1 PS queue as follows: Historically, the traffic load has been determined by the arrival rate of systems and the BS sequence.The computational traffic model is unfeasible due to the unpredictable traffic of devices.The notations used in the system architecture are summarized in Table 1.

Traffic model
We base this model on the network setting evaluations indicating the difference in BS traffic arrival.The device structure shift is used to model device traffic variation based on the regular trapezoidal traffic pattern.Because cellular traffic is highly dynamic in time and space, we defined it as a time-homogeneous PP with a traffic circulation intensity parameter T and a stabilized rate f ðtÞ (Alam & Dooley, 2015).This changes the trapezoidal traffic design within a period, differing its probabilistic traffic system χðtÞ in the network.With this, χðtÞ is interpreted as: χðtÞ ¼ ðf ðtÞ Á ψÞ; (6) where the function ψ $ Poi ({T }) represents its random variable with parameter T .As a result, if each device arrived at a BS k and sustained a service time h ik per second, standardized traffic load at each BS k at time t is regarded as: where a portion of B is z k at BS k allocated to devices.

Energy consumption
In this part, BSs are made up of continuous load-dependent and non-load-dependent energy utilization that corresponds to their traffic volume (Abdulmula et al., 2019).When the BS is loaded, load-dependent energy utilization is provided by the energy amplifier and transceiver.The load-dependent energy utilization has been mounted with the standardized traffic-load ρ Ã k ðtÞ to determine the energy usage at a stage.Therefore, the overall energy utilization at time t is The average delay e experienced by the device i on BS k T Traffic intensity coefficient in the traffic model where P l k denotes load dependence, P c k denotes constant energy flow, and BS k traffic load is P t k at the time slot t.Given the difference in traffic demands, our proposed framework is to regulate the complex network's energy utilization by switching idle BS OFF to efficiently manage energy.

Utility model
The prerequisite for managing the EC of BSs is to meet D-QoS requirements and ensure device network scalability.To guarantee this, the delay and requisite communication rate must be ensured.With the complexities associated with device i behavior, traffic request per connection, and service permeability, we model this part using a sigmoid function.To meet device satisfaction, we adopt ξð:Þ as a function that sets dissimilar optimized goals for both rate constraint and delay as identified in Delaram et al. (2021).In our work, we define our QoS utility as the user's satisfaction with either data rate or delay depending on the application type in our study.The device satisfaction on the rate is defined as: where r min τ is the least rate demand of device i and the sustained n responsible for the device satisfaction curve.We can validate that (a)ξðr τ Þ is a monotonous function in relation to rI since individual devices would be satisfied if a higher output is achieved above its least requirement, otherwise (b) ξðd τ Þ of each device is mounted between 0 and 1, ξðr τ Þ 2 ½0; 1: Delay on device satisfaction is set as: where τ max denotes the optimum tolerant delay necessary to meet the upper bound delay for the device i.In analyzing device network scalability in our distributed network environment, we set an assumption on the following: 1. Time stationarity = l ik ðtÞ ¼ l ik 2. Device Independence, l ik ¼ f ðs i ; s k Þ 3. Switch mode (ON/OFF) = l ik 2 ½0; 1 Note that the transmission link between device i and BS k is l ik .The geometric distance ðdÞ between the device i and BS kis denoted as:

Problem Formulation
A tuple < S; A; P; R; γ > is defined as MDP, where A and S represent the finite set of all legitimate states, as well as the finite set of all legitimate actions.The state P : S Â A ! PðSÞ represents a transition probability, where P s ðtþ1Þ js t ; a t À Á is the transitioning probabilities of time t þ 1 into states, and s tþ1 is an agent begins action a t execution at time t.Our reward function R : S Â A Â S !R and γ, where R t ¼ ðs t ; a t ; s tþ1 Þ.With the action ðaÞ, the switch dynamically situates the BS k into sleep/off mode, i.e., sets a k ¼ 0, else sets a k ¼ 1 (Büttner et al., 2021;Sun et al., 2022).
At any stage t with a traffic-load state s ðtÞ , the objective is to discover the ideal policy π Ã which corresponds to s ðtÞ of an action a ðtÞ that exploits the action-value task.Let U j s ð0Þ ; s ð1Þ ; . . .; s ðtÞ j; Â a ðtÞ signify a Markov chain utility.The continuing cumulative reduced reward of s ðtÞ at stage t is assumed: where the discount factor γ 2 ½0; 1.The state-value task of a random policy at the stage t is denoted as: The goal of MDP is to find an optimum strategy to exploit the upcoming reward of the resulting agent.From Markov's theory, the policy π can be defined as: Pðs 0 js ðtÞ ; a ðtÞ ÞV π ðs 0 Þg; where V π ðs ðtÞ Þ is the expected utility given the optimum strategy π.With Bellman's mathematic theory, the state-value task for the best policy π is expressed as: Pðs 0 js ðtÞ ; a ðtÞ ÞV π ðs 0 Þg; where the present reward is Rðs ðtÞ ; a ðtÞ Þ, and the discount factor is γ while the current utility is V π Ã ðs ðtÞ Þ and its future utility is V π ðs 0 Þ, respectively.Our objective is to find the optimum strategy π Ã ¼ argfmax π V π ðsÞ that affects the on/off switching results for each BS, which reduces the overall EC in this scenario.With our cell activation, the state space, action space, and reward task are as follows: 1. State Space: For ð8t ¼ 0; 1; 2; 3; . . .:; 23Þ; the state space of the devices and BSs is represented as: In a day, T active tÀ1 and T sleep tÀ1 denote the predicted active and sleep states at a time t.The traffic arriving rate is denoted as pt.

Action Space:
The learning agent must set the ON and OFF strategy at each time t for the best reward.We establish the action as a k ¼ 0; otherwise set a k ¼ 1: Since a k 2 0; 1 ½ as formulated in Boltzmann distribution probability (Shingu et al., 2021).
3. Reward: Our primary aim is to manage BS EC level and, on the other hand, meet D-QoS requirements.For managing EC, E s; a ð Þ; the agent obtains a reward that proves an enhanced system EC built on the switching procedure of the BS.With the D-QoS satisfaction δðs; aÞ; the delay optimal metrics and rates are used to assess each device performance.On the basis of the device satisfaction, the agent sets an r min threshold value.With this, we define our reward, Rðs; aÞ as: Rðs; aÞ ¼ ðα Á Eðs; aÞ þ β Á δðs; aÞÞ: (17) Note that α and β denote the coefficients that show the significance of device satisfaction and EC.A secularization function is often used to describe a multi-objective reward that reduces multiple goals to a sin-gle scalar that can be optimized.In this work, we define the given transition, D a 0 ¼ ðs a 0 ; a a 0 ; s 0 ; r a 0 Þ, in the FL setting as collected by agent a 0 ; and pairs of actions and states D b 0 ¼ ðs b 0 ; a b 0 Þ as collected by agents b 0 .The objective is to distributively build policies π Ã a 0 and π Ã b 0 for agents a 0 and b 0 , respectively.For simplicity, we considered two federated participating devices.However, the same mechanism could be extended between the two or more class agents.We illustrate state-action policies and Q-functions with respect to a 0 and b 0 ; as: Based on these assumptions, we aim to learn policies π Ã a 0 and π Ã b 0 of high quality for agents a 0 and b 0 by performing efficient switching and meeting D-QoS status.

Methodology
We briefly explain the concept of FL and describe our Fed-DRL framework.

FL algorithm
Algorithm 1 describes the adopted FL procedure.Specifically, the local device and the global server communicate iteratively to train a model.In this setup, the global server computes a weighted average of the resulting models after each device performs one step of gradient descent on an intermediate model using its own local data.There are five parameters: the local mini-batch size B, the number of local epochs E, the fraction of devices i c to choose for training, a learning rate decay λ, and a learning rate η.Mostly, for stochastic gradient decent (SGD), B; E; λ; η are used.Before the server is updated, the number of iterations needed via the local device is E. The global model w 0 is randomly initialized.A single round of communication involves the following: A subset of devices S t ; jS t j ¼ i c Á I1, shares current global model w t to all devices in S t .After they have revised their local models w i t to the distributed model, w i t w t ; each device creates its own local data into batches B to perform E epochs of SGD.Lastly, the local devices upload their trained model w i tþ1 to the server to generate new global models by calculating the weight of all local device models.

The proposed Fed-DRL for EM
We describe and demonstrate the effectiveness of our proposed Fed-DRL framework that manages BSs EC in a distributed network architecture.In this process, the local device and the global server communicate with each other iteratively and effectively to train the model.As stated, we consider multiple DRL agents to train our model.
In our distributive scenario, the corresponding agents in a continuous action space schedule the EC of the BS.All agents are considered to initiate their learning process synchronously and select their actions through an initial distribution.Furthermore, the agents add a neural network to receive the initial state function Qðs t ; a t Þ and measure Aðs t ; a t Þ in order to improve model efficiency.Note that the action A is controlled by the switching (ON/OFF) method.Thus, we fixed the action as a k 2 f0; 1g.If BS kis in OFF mode, it indicates a k ¼ 0, otherwise a k ¼ 1.The function Rðs; aÞ ¼ ðβ Á ξðs; aÞ þ Pðs; aÞ Á αÞ defines the reward.Hence, a and β represent the constants objective of EC and device QoS satisfaction.The reward model will independently learn and adapt to the scale of its value to fit the updated scenario according to the scorebased merging mechanism.We denote the constant α as: where σðÁÞ is illustrated as: After completing the local training process, each agent sends its trained model to the global server.Finally, all device agents receive globally produced models at the same time.The agents synchronously resume the learning process by using the given global model.

Performance Evaluation
This part evaluates our proposed framework with other algorithms.Simulation and experimental findings reveal that our proposed Fed-DRL significantly improves EC and meets D-QoS satisfaction in a distributed network environment.For comparison, we compare our framework against Q-Learning, deep DQN, and dueling DQN, and we adopted performance metrics in terms of energy, D-QoS, and convergence analysis with different mobility scenarios to test the robustness of our proposed framework.The BSs have been arbitrarily distributed within the range of coverage of the distributed network.We presume that our distributed network system initially handles a maximum of 10 BSs.The device bandwidth is set at 20 MHz, and the device sensitivity level for edge devices is set at -120 dBm.The population of participating devices varies according to dynamic mobility to reflect the profile of the traffic model.
Furthermore, we carried out all simulations in a Python 3.8 environment.We experiment on an Ubuntu 20.04 operating system with 16GB of RAM and a RTX 2060 12-core GPU.
Fed-DRL is implemented with both the Tensor Flow and Keras Python library.Table 2 shows other simulation parameters of our experiment.

EC and device QoS satisfaction
In our simulation, we consider 1 h as a decision cycle and observe the performance within 24 h as an episode, as shown in Figure 1.
We see that increased traffic load is linked to a high amount of stabilized energy use by BSs.This is because more BSs are needed to meet high device demands.
To control the total amount of energy BSs use, our proposed Fed-DRL algorithm performs better than the benchmarked algorithms in light-load and heavy-load situations within an episode.Based on the results of the experiments, the training performance that was set helps to use the least amount of energy.
Relatively greater EC is seen in Q-learning, which also suffers from the curse of dimensionality as the number of participating devices and BSs increases.We measure our D-QoS satisfaction in Figure 2 to determine the superiority of the switching procedure in the distributive network.The assessment value for an appropriate distribution of resources to devices is QoS satisfaction.Comparing D-QoS satisfaction to the benchmark algorithms, we observe that Fed-DRL recorded the highest device level of satisfaction.
We can clearly state that using Q-learning increases OPEX in terms of energy utilization.The participating device's level of QoS satisfaction is reduced by 33% and 25%, respectively, as the episode increases.For participating devices, the deep DQN and dueling DQN maintain an average QoS satisfaction level of about 52%.The proposed Fed-DRL recorded an appraisal device QoS satisfaction of about 67%.

Effects of decision epoch
We illustrate the EC variance and device satisfaction variance in Figures 3 and 4 under diverse mechanism schemes.
The mobility behavior of devices under diverse epochs permits us to describe statistically how individual BSs react to their respective mobility trails in the distributive network.In this case, we look at our differences to see what happens at decision epochs 10, 20, 40, and 60.We experienced that with a long interval, the decision epoch upturned the variance between participants' satisfaction and BS's EC.This indication simply points out that the algorithm rises with the decision epoch and takes a long time to converge.However, as illustrated, our proposed Fed-DRL outperforms the benchmark algorithms for D-QoS satisfaction and BS EC.With BS's EC, Figure 3 indicates that Fed-DRL recorded the least variance, as there was an increase in the decision epoch.Compared to the dueling DQN, the EC variance is reduced when the decision epoch is set from 10 to 60.
Similarly, in deep DQN recorded an approximate reduction when observed between 10 and 60 min.In the proposed framework, we realize that setting the decision epoch to 10 min yields the paramount EC of BS.Deep DQN and dueling DQN recorded extraordinary variance owing to their poor convergence output.
Figure 4 shows a comparison of the changes in D-QoS satisfaction in the same scenario.Our research shows that the device requirement for D-QoS satisfaction is met as decisionmaking periods increase.We also found that the system is very unstable when the decision epoch is high because of the time between decision epoch 40 and decision epoch 60.Our proposed Fed-DRL has a significance value of about 0.008, which is lower than Q-learning, deep DQN, and dueling DQN, which have values of 0.013, 0.033, and 0.028, respectively.The proposed framework has a fairly low average value, which shows that it is strong and can converge for different decision epochs.

Convergence analysis
In this part, we only concentrated on the convergence analysis of the proposed Fed-DRL framework in terms of cumulative reward, BS EC, and D-QoS satisfaction.This is because the results in these preceding works indicate a slow convergence rate in Q-learning, deep DQN, and dueling DQN algorithms.For our convergence performance, simulation was conducted under countless distributed traffic loads.As shown in Figure 5, the Fed-DRL framework converges speedily at about Episode 260 in terms of cumulative reward.The framework displays consistently lower average EC levels than other results for Q-leaning, deep DQN, and dueling DQN, respectively.
Similarly, the framework recorded a 0.5 satisfaction threshold, while most participating devices in deep DQN and dueling DQN do not converge to the desired satisfaction.The Fed-DRL framework achieves the required satisfaction with lower EC, as it can attain stable convergence.Since our framework trains distributively, it is observed that EC and device satisfaction increase significantly before convergence while minimizing their variance until they attain convergence.
With the convergence performance, we concluded that the proposed framework is robust in a distributed scenario.Our proposed framework seems to outperform deep DQN and dueling DQN because it converges faster at about episode 170.This indicates significant regularity in terms of network scalability.Also, we show that our proposed framework can generate switching configurations that balance all devices and keep track of how much energy they use in a distributed network.
While FL offers several advantages, it has limitations and potential drawbacks, particularly in the above scenario.Communication overhead is one of the drawbacks of this approach.Communication is typically slower and less reliable in a wireless network than in wired networks.Transmitting model updates and gradients over wireless connections can introduce delays and increase communication overhead, impacting the training process.The limited bandwidth and potential packet loss in wireless networks can hinder the efficiency of our proposed approach.In addition, with privacy concerns, keeping the data decentralized and performing model updates locally.However, wireless networks, especially public or unsecured networks, pose additional privacy and security risks.Malicious actors could attempt to intercept or manipulate the communication during the FL process, potentially compromising the integrity and privacy of the data or model.

Conclusion
In this work, we propose a switching framework for EM using Fed-DRL.Specifically, we employ FL to help train the DRL agents efficiently because of its ability to improve learning performance, solve EM problems, and finally outperform the traditional DQN in training.
Finally, the framework solves the EM problem by balancing the levels of EC of the BS and achieving device satisfaction with a each round local epoch, i from 1 E do 12: batches ← data P i split into batches B ð Þ 13: for batch b in batches do 14: ω ω À ηrπ ω; b ð Þ 15: end for 16: end for 17: return ω to the server Figure 1 Stabilized energy consumption

Figure 2
Figure 2 Algorithm comparison on device QoS satisfaction

Figure 4
Figure 4 Effects of decision epochs for Fed-DRL in device satisfaction

Table 1
Major notationThe normalized traffic load on BS k r ik Transmit rate of the device i attached to the BS k τ ik

Table 2
Simulation parameters