An improved logistic probability prediction model for water shortage 1 risk in situations with insufficient data 2

In drought years, it is important to have an estimate or prediction of the 5 probability that a water shortage risk will occur to enable risk mitigation. This study 6 developed an improved logistic probability prediction model for water shortage risk in 7 situations when there is insufficient data. First, information flow was applied to select 8 water shortage risk factors. Then, the logistic regression model was used to describe 9 the relation between water shortage risk and its factors, and an alternative method of 10 parameter estimation (maximum entropy estimation) was proposed in situations 11 where insufficient data was available. Water shortage risk probabilities in Beijing 12 were predicted under different inflow scenarios by using the model. There were two 13 main findings of the study. (1) The water shortage risk probability was predicted to be 14 very high in 2020, although this was not the case in some high inflow conditions. (2) 15 After using the transferred and reclaimed water, the water shortage risk probability 16 * Correspondence to: Ren Zhang, Institute of Meteorology and Oceanography, National University of Defense Technology, Nanjing, China, 211101 E-mail: zrpaper@163.com declined under all inflow conditions (59.1% on average), but the water shortage risk 17 probability was still high in some low inflow conditions. 18


Introduction
Nowadays, water shortages have become a serious problem in many parts of the world due to climate change, heightened demand of water and integrated urbanization, and there is a negative impact on the security and sustainable development of water resources (Giacomelli et al., 2008;Weng et al., 2015;Christodoulou 2011;Wang et al. 2012;Yang et al. 2015Qian et al. 2014;Li et al. 2014).Risk is a measure of the probability and severity of adverse effects (Haimes, 2009).It is important to have an estimate or prediction of the probability that a water shortage risk will occur so that effective measures for risk mitigation can be developed, particularly in the case of precipitation deficits (drought).Hashimoto et al. (1982) stated that risk can be described by the probability that a system is in an unsatisfactory state.How to predict or estimate risk probability is still an open issue with no definite solution.Mackenzie (2014) believed that an analyst should first develop a probability distribution over the range of consequences that fully describe the risk of an event.The simulation of probability distribution should be based on a large number of data (Bedford and Cooke, 2001;Giannikopoulou et al., 2015).Unfortunately, a full probabilistic assessment is generally not feasible, because there is insufficient data to quantify the associated probabilities (Tidwell et al., 2005).
In some cases, frequency is often used as a substitute for probability in the risk assessment of water resources (Hashimoto et al., 1982;Rajagopalan et al., 2009;Sandoval-Solis et al., 2011), while in other cases, interval-valued probabilities and fuzzy probabilities have been proposed to elaborate the concept of an imprecise probability (Karimi and Hüllermeier, 2007).However, these approaches only consider the probability of the hazard without consideration of the impact of risk factors.The risk factors include characteristics of hazards and existing conditions of vulnerability that could potentially harm exposed people, property, services and so on (UNISDR, 2009).There are many aspects of vulnerability arising from various physical, social, economic, and environmental factors (Qian et al., 2016;Haimes, 2006;UNISDR, 2009).Therefore, it has been concluded that modeling risk probability requires a consideration of vulnerability (Haimes, 2006).Although increasing attention has been given to vulnerability assessment (Villagrán, 2006;Plummer, 2012), there have been few studies of the relation between risk probability and water resources vulnerability.
A water shortage can either occurs or not occur, and therefore water shortage risk is a binary categorical variable.According to statistical theory, a logistic regression model is a nonlinear regression method of studying a binary categorical or multi-categorical variable and its impact factors (Breslow, 1988).Therefore, a logistic regression model can be used to describe the relation between water shortage risk and its impact factors.The parameters of a logistic regression model are often estimated by a maximum likelihood estimation; a large number of observed values of risk (i.e., samples that water shortage risk does or does not occur) and risk factors are required for parameter estimation (Balakrishnan, 1992).However, the statistical data about risk and its factors are insufficient in China.Therefore, the method of maximum likelihood estimation is not applicable when the sample size is small.For this reason, we propose an alternative method of parameter estimation for a logistic regression model when data is insufficient.Moreover, the backward mode is often applied for the selection of sensitive factors, but the calculation is very complicated.
The contributions of our paper are as follows.First, we used a logistic regression model to explore the nonlinear relation between water shortage risk and its factors.
Then, we introduced an information flow (Liang, 2014) for the selection of significant risk factors.Compared with the backward mode, it was very easy to determine whether there was a cause and effect between the water shortage risk and its factors.
Finally, we proposed an alternative method of parameter estimation (maximum entropy estimation) for a logistic regression model in situations with a lack of data.
The new method requires only a few data, while maximum likelihood estimation requires a large amount of data.
The remainder of the paper is organized as follows.Section 2 presents the principles and structure of the logistic probability prediction model for water shortage risk.Section 3 presents the application of the model and the results of the research and Section 4 presents some conclusions and proposes future work.

Study area
Beijing, China's capital, is located in the northwest of the North China Plain, and consists of five water systems from the east to the west (Figure 1).The average annual precipitation is 585 mm.Precipitation in summer accounts for 70% of the total for the whole year.Beijing, with a population of more than 20 million, is faced with a severe shortage of water resources.The amount of self-generated water resources is only 37.39×10 8 m 3 .The amount of water resources per capita is about 200 m 3 , which is about one eighth of the value of water resources per capita for China and one thirtieth of the global value of water resources per capita.
The available surface water and groundwater is unable to meet the needs of the city's economic and social development.Some measures, such as the use of transferred and reclaimed water have been put in place to mitigate the water shortage.
In 2014, through the South-to-North Water Diversion Project, water was channeled from the Danjiangkou Reservoir in central China's Hebei province to Beijing.
Reclaimed water is also essential for Beijing and is mainly used for agricultural irrigation and toilet flushing.

Data collection
The data used in this paper were obtained from various sources.The inflow and

Model development
A flowchart showing the operation of the probability prediction model for water shortage risk is given in Figure 2.

Identification of water shortage risk factors
Water shortage risk factors include characteristics of hazards and existing conditions of water resources vulnerability.Water resources vulnerability is referred to as the manifestation of the inherent states (e.g., physical, social, and ecological) of the water resources system that causes the system to be liable to a water shortage (Qian et al., 2016).According to the study of Plummer et al. (2012) ).These indicators are defined as follows (Qian et al., 2014): whereW is the total amount of water resources, and N is the population size.(2) where ss W is the surface water supply, gs W is the groundwater supply, and W is the total amount of water resources.
where t DS is the amount of sewage treated and DS is the total amount of sewage discharged.
where as W is the water supply, and td W is the water demand.
where IW is the industrial water use, AW is the agricultural water use, DW is the domestic water use and WU is total water use.

Selection of important risk factors
The purpose of this section was to select some important factors that have an significant impact on water shortage risk.Liang (2014) reported that the cause and effect between two time series can be measured by the time rate of information flowing from one series to the other.Liang proposed a concise formula for causal analysis.The causality is measured by information flow.Therefore, we can use the information inflow to unravel the cause-effect relation between the risk factors and water shortage risk.
According to Liang (2014), for series 1 X and 2 X , the rate of information flowing (units: nats per unit time) from the latter to the former is where ij C is the sample covariance between i X and j X , , j i d C is the covariance between i X and j X  ,and j X  is the difference approximation of j dX dt using the Euler forward scheme.
, , , According to Liang (2014), with 1 k  , for a general time series 1 k  would be suitable.If 2 1 0 T   or the absolute value of 2 1 T  is less than 0.01, 2 X does not cause 1 X , otherwise it is causal.A positive 2 1 T  means that 2 X functions to make 1 X more uncertain, while a negative value means that 2 X tends to stabilize 1 X .Liang (2015) proposed a method of normalizing the causality between time series and the range of value for 2 1 T  is 0 and 1.

Correlation analysis of selected risk factors
In theory, a probability prediction model requires variables to be mutually independent.Therefore, it is necessary to perform a correlation analysis.Because all of the factors are continuous variables, Pearson correlation coefficients are often applied.If the absolute correlation coefficient is greater than 0.5, there is a significant correlation between two factors.

Risk probability prediction model using maximum entropy estimation
A logistic regression model is a nonlinear regression method of studying a binary  , , , , m      is as follows: According to Eq. ( 12), a large number of observed values of risk ) and its factors are required for parameter estimation.Unfortunately, the correlated samples between risk and its controlling factors are insufficient.It is therefore far better to estimate the parameters.In this case, the maximum likelihood estimation is not applicable for parameter estimation.An alternative approach for parameter estimation is therefore required.
Thus, we proposed a new parameter estimation method based on the maximum entropy principle.The new method is named after maximum entropy estimation.The new method does not require the observed values of risk, and it requires only some observed values of the factors.Its principle is as follows.
For an observation, we can define its entropy to evaluate its degree of uncertainty.
According to Jones and Jones (2000), the entropy of the ith observation of water shortage risk is where C is a positive value and . According to the maximum entropy principle, if the values of ( ) i H P reaches a maximum, the optimal parameters are obtained (Jones and Jones, 2000).The reasons for obtaining a solution based on the maximum entropy principle are as follows.① It conforms to the principle of entropy increase, which states that the entropy of an isolated system tends to reach a maximum.② It accords with the principle that the solution should be in line with the sample/data and the least hypotheses must be constructed regarding the unknown parts when the data is insufficient.③ It fits the maximum multiplicity principle.The multiplicity of a state refers to the number of possible ways in which a system can evolve to that state.The maximum multiplicity principle states that the greater the multiplicity of a state, the larger the possibility that a system is in this state.

Parameter estimation
Based on the analysis above, an optimization model can be constructed as follows: According to the extreme theory of multivariate function (Khuri 2003), we can obtain The optimal estimation   reach the maximum value.According to the maximum entropy principle, the greater the entropy is, the larger the uncertainty of an observation is.Therefore, the maximum value of the sequences taken as the objective function of the optimization model.

Goodness-of-fit test
According to Brown (1982), a goodness-of-fit test should be made for evaluating the fitting effect of the logistic regression model and its ability to identify water shortage risk.In this study, the Kolmogorov-Smirnov Test (K-S) test and Pearson 2  test are used.

K-S test (t)
A K-S test is often applied as a fitting test.It can be used to test the ability of the model to identify water shortage risk.The value of K-S is between 0 and 1; the greater the value is, the better the logistic model is.The idea is as follows.
Let   1 n F x be the cumulative probability distribution of the samples that do not encounter a water shortage.

 
2 n F x is the cumulative probability distribution of the samples that encounter a water shortage.A two independent samples test is then applied to compare whether the empirical distribution functions of two samples are the same.The test is as follows: The value of K-S is: When N   , the cumulative distribution curve and probability density curve of two samples can be obtained.The value of K-S is the maximum value of the cumulative distribution functions.When the value of K-S is greater than 0.35, the logistic regression model is applicable.The international classification standard of the logistic model is shown in Table 1 (Brown, 1982).The expression of the 2  statistic is as follows.
  where 1 2 , , , j l   , l is the number of covariant types, j O is the observed frequency of the jth covariant type, and j E is the predicted frequency of the jth covariant type.The degree of freedom is the difference between the number of covariant types and parameters.

Results and discussion
In this section, a logistic probability prediction model for water shortage risk is constructed and discussed, and the risk probability in 2020 in Beijing is predicted using the proposed model.

Construction of the Logistic probability prediction model
Therefore, there are only 34-year data.

Determination of water resources vulnerability indicators
Based on the risk factors sequences from 1979 into 2012 (Table 2) and the method of normalized information inflow (Liang, 2015), the values of normalized information flow from the factors to risk are shown in Table 3.According to the normalized information flow results (Table 3), the value of the normalized information flow

Construction of the logistic risk probability predication model
The (20), the predicted probability values of water shortage risk by the maximum entropy estimation is shown in Fig. 3.If 0.5 is taken as threshold used to judge whether water shortage risk occurs, then the prediction accuracy by using the maximum entropy estimation can be obtained, and is shown in Tables 5. From Table 5, it can be seen that the average accuracy rate using the maximum entropy estimation was very high (91.18%).The maximum entropy estimation does not need observed values of risk (   ), whereas the maximum likelihood estimation needs a large number of observed values of risk. test, it was concluded that the model was applicable.

Risk probability prediction (without considering the use of transferred and reclaimed water)
Because the inflow of 2020 is unknown, the inflow condition in 2020 was assumed to be any annual inflow conditions from 1956 to 2012.In this section we predict the risk probability of 2020 under different inflow conditions from 1956 to 2012.The sequences for risk factors ( c W , p W , r U , P and r DS ) were obtained and computed as follows.The precipitation in 2020 is assumed to be any annual precipitation from 1956 to 2012.First, an analysis of the balance between water supply and demand was performed and the sequences of water supply and demand under the inflow scenarios of 1956-2012 were obtained (Qian et al., 2016).The GDP of 2020 was the sum of the gross agricultural product, gross industrial product, and gross product of the third industry (details of the third industry are shown in Appendix A), using information taken from the literature, and was estimated to be 4711.852 billion CNY (Qian et al., 2016).N (the population size of 2020) was 24.43 million (Qian et al. 2016).The total amount of water resources from 1956 to 2020 were considered to consist of fifty-seven types of water resources in 2020.Substituting the total water resources sequences and N of 2020 into Eq.( 1), the sequence of p W could be computed.
Substituting the water demand sequences and GDP of 2020 into Eq.( 2), the sequence of c W could be computed.Substituting the sequence of the total water resources and water supply for 2020 into Eq.( 3), the sequence of r U could be obtained.Substituting the sequences of c W , p W , r U , P and r DS into Eq.( 20), the probability that a water shortage risk will occur in 2020 under the inflow scenarios of 1956-2012 was predicted, and is shown in Figure 4.
In Figure 4, the horizontal axis represents the inflow conditions of 1956-2012.
Figure 4 shows that in 2020, the water shortage risk probability exceeded 0.95 under 33 different inflow conditions (accounting for 63.5% of all the inflow conditions) and exceeded 0.5 under 38 different inflow conditions (accounting for 73.1% of all the inflow conditions).In summary, there was a high probability of a water shortage risk in 2020, although the probability was very low in some high precipitation periods.

Figure 1 .
Figure 1.Distribution of water system of Beijing precipitation sequences from 1956 to 2012 were provided by Beijing Hydrological Station.The water demand for 2020 was based on the Beijing City National Comprehensive Plan for Water Resources (Beijing Municipal Development and Reform Commission and Beijing Municipal Bureau of Water Affairs, 2009).The water supply sequence for 2020 in the inflow conditions of 1956-2012 was computed by an analysis of the balance between water supply and water demand.The population size and gross domestic product (GDP) from 1979 to 2012 were taken from the Statistical Yearbook 2014 of Beijing City (Statistical Bureau of Beijing City, 2014).The total amount of water resources from 1979 to 2012 were provided by Beijing Hydrological Station.The water use statistics and data regarding the treatment of domestic sewage from 1979 to 2012 were taken from the Statistical Yearbook 2014 of Beijing City (Statistical Bureau of Beijing City, 2014).

Figure 2 .
Figure 2. Flowchart showing the operation of the improved probability prediction model for , there are 50 different water vulnerability assessment tools, and the water vulnerability indicators of these tools are quite different.Therefore, a universal standard understanding of water resource vulnerability indicators is difficult to develop.We established the indicators from perspective of hydrological conditions, water resources, water supply and water use.The risk factors are: precipitation ( P ), water resources per capita ( p W ), water consumption per GDP ( c W ), satisfactory rate of water demand ( r S ), and utilization rate of water resources ( r U ), proportion of industrial water use ( p IW ), proportion of agricultural water use ( p AW ), proportion of domestic water use ( p DW ) and the treatment rate of domestic sewage ( r DS  are the estimated parameters.The parameters are often determined by a maximum likelihood estimation.The log likelihood equation of computing 1 2 from p AW to water shortage risk is only 0.0031, and it is very little.It was concluded that the p AW does not result in a water shortage risk.Therefore, p AW was removed as risk factors.

Figure 3 .
Figure 3.The predicted probability generated by the maximum entropy estimation from 1979 The r DS of 2020 was about 90% (Beijing Municipal Development and Reform Commission and Beijing Municipal Bureau of Water Affairs, 2009).

Table 1 .
The international classification standard of the logistic model 0 H : the fitting is good 1 H : the fitting is bad (18) A sequence of risk factors were obtained for the period from 1979 to 2012, and were computed based on Eqs.(1)~(8).The risk sequence

Table 2 .
The values of the risk factors and risk from 1979 to 2012 r U P (mm) r DS (%)

Table 4 .
Pearson correlation coefficients for the relations between various factors

Table 5 .
The prediction accuracy using the maximum entropy estimation  test are performed and the results of the tests are obtained.The value of K-S is 0.955 and according to Table1, the logistic probability . was equal to 4.605 and was much greater than 2.333.Therefore, the null hypothesis was accepted, i.e., the fitting of the model was very good.Based on the results of the K-S test and Pearson 2