The influence analysis of criteria of comparables’ selection on the accuracy of the property value estimation

Jacek Zyga

doi:10.2478/rgg-2020-0002

1. Introduction

The problem of determining the regression model parameters is already resolved and described in numerous literary positions: from basic, purely theoretical (Rao et al., 2008; Sen and Srivastava, 1994) through application descriptions (e.g., Vecchia (1988); Weisberg (1980)) up to summarising positions (Glumac and Des Rosiers, 2018; Manly and Alberto, 2016; McCluskey and Borst, 2017). Against the background of this literature, the question of applying a regression analysis to the real property valuation is also seen as a well-explored field. Because of the big number of the literature reference items, only a few of them are listed: Albritton (1982); d’Amato and Kauko (2017); Bruce and Sundell (1977); French (2003, 2004); Isakson (1986); McCluskey and Borst (1997 McCluskey and Borst (2017); Pagourtzi et al. (2003); Peto et al. (1996); Schlaes (1984); Shenkel and Eidson (1971); Skaff (1975); Tchira (1979); Thompson and Gordon (1987). Despite the extensive writing base in this area, there is very little attention paid to the question of the sold properties similarity. In addition to the very local voices of valuation practitioners, only in a few and rather old scientific publications, for example, Shenkel and Eidson (1971); Skaff (1975); Tchira (1979) or Czaja (1997), there were comments on the comparative objects’ selection based on their mutual likeness. At the heart of these concepts was the collection of objects perhaps few, but in a certain way similar to the valued property. Today’s trends in the development of real estate valuation methodologies indicate the advantage of the method of developing data over the original information and its usefulness for specific valuations. The issue of similarity is the subject of less and less attention. In some areas of the economy (mainly banking, taxes), econometric techniques as overarching techniques are recommended. This is contrary to the regulations of many professional associations (International Valuation Standards Council (IVSC), 2010; Royal Institution of Chartered Surveyors (RICS), 2017; Standards’ Commision of Polish Federation of Valuer’s Associations (SCPFVA), 2009) but also to direct legal directives (Bundesministerium für Umwelt Naturschutz Bau und Reaktorsicherheit (BMUB), 2006; ImmoWertV, 2009; Act, 1997; Regulation, 2004). Aims of this paper are: to direct attention again to the role of similarity seen as a factor, that can define the scope of real estate (or land) property local market and to demonstrate relationship between a similarity degree of collected comparables and undertaken estimations accuracy.

2. Property value model

The linear price model consists of several structural parameters x_j and corresponding observations of sale subject features a_ij. Combinations of those components juxtaposed with the sale prices y_i create a price model equation (1).

(1)

yi=ai0+ai1χ1+…+aijχij+…+aikχik+εi,

where:

i = 1, 2, . . . , n i ∈ N, where n is a number of observations y_A,
j = 1, 2, . . . , k j ∈ N, where k is a number of structural parameters x_A,
a_i0 = 1.

Same number of such equations can be expressed as the system in matrix form (2). The solution for the system (2) of observations Y_A and A is given (according to the Gauss-Markov Theorem) as the best linear unbiased estimator of structural parameters XA^ (3) with minimum condition (4):

(2)

YA=AXA+εA,

(3)

XA^=(ATA)−1ATYA,

(4)

(YA−A(ATA)−1ATYA)T(YA−A(ATA)−1ATYA)=min.χ0…χk

To emphasize the relationship between specific Y and X vectors, expressed in the dimensions of matrix A: n i k, complementary markings in indexes were added in the following equations. This issue is important when, by selecting elements y_i and the consistent need of elimination of some independent variables, a_ij too poorly correlated with the vector y_i, the shape of matrix A itself also changes.

Elements ε_{A_i}, collected in vector ε _A, are stochastic components of each i-th equation.

On the field of property valuation the variable Y_A, can be understood as the vector of sold property prices y_{A_i} (usually recalculated into proper surface unit prices), while the matrix A is the set of market feature evaluations, recorded while the survey of comparable properties.

The vector of price observations Y_A, market feature evaluations set A with a set of listed attributes Ψ_A, create together an information system (5) defined by Pawlak (1981, 1983) as:

(5)

⟨YA,ΨA,Va,a(y,χ)⟩,

where:

Y_A is the set of price observations,
Ψ _A is the list of attributes (market features), in (2) each element of Ψ _A has its own value shown in X_A,
Va is the set of all possible values of attributes (market features),
a(y, x) are values of attributes, relationship between Y_A and Ψ _A, such Y_A × Ψ _A → A.

In this paper, for short, it will be presented as:

(6)

⟨YA,ΨA,A⟩

The system (5) also represents a common price/value model (2) used to obtain econometric models of the property value (Pagourtzi et al., 2003). In a practice of econometrics, the main goal of the use of regression models is to find the model itself or to confirm the significance of its several listed parameters. In an appraisal practice (based on the econometrics anyway), the prediction of values of some properties is the most important.

The prediction of the value YVA^ of the property from the outside of the collected set of comparables (Y_A, A), but based on it, needs to define the vector F_V comprising evaluation marks given to the appraised property (respectively, to the list of parameters covered by XA^:

(7)

YVA^=FV(ATA)−1ATYA.

As well as the value prediction itself, an accuracy of the obtained estimation results and being sure of appropriation of them are essential in appraising activity.

Statistical techniques of LSM gives some solutions regarding how to test the obtained model. The useful parameters are: the residual variance S² of the model, presented in equation (8) as SA2 in order to emphasise it’s reference to matrix A, the coefficient of variation V (9), a variance/covariance matrix of structural parameters (10) or the determination coefficient R² (11),

(8)

SA2=1n−5−1(YA−A(ATA)−1ATYA)T(YA−A(ATA)−1ATYA)

(9)

V=SAYA,

(10)

cov(XA^)=D2(XA^)=SA2(ATA)−1,

(11)

R2=1−YATYA−YATA(ATA)−1ATYAYATYA−nY¯A,

where:

n is number of observations y_A and rows number of Y_A,
k is number of structural parameters x_A and k + 1 is a columns number of A,
Y¯A is mean sale price.

Above indicators are not enough for the valuation purpose, because they give information about the statistical evaluation of a used model only, without any judgement, if this model is good enough for the valuation of the exact object (even if the object is within the scope of the limitations of the considered local market). This question can be solved with indicators connected with the valuation object. For example, with ex ante indicators: like the residual variance SVA2 of YVA^:

(12)

SVA2=SA2(FV(ATA)−1FVT)+SA2

or with the relative prediction error:

(13)

VFP=|SVAYVA^|.

But the strongest verification of each value prediction is the ex post juxtaposition of the obtained result with the independent price eventually created by the market. Therefore, ‘the accuracy’, mentioned in the title, will be understood in further considerations as an ex post prediction error Q, defined as the difference between the ‘true’ or ‘real’ value Y_VR and the predicted value YVA^ (14) or as its relative form V_FR (15):

(14)

Q=YVR−YVA,^

(15)

VFR=|YVA−YVA^YVR|.

3. The research problem outline

The goal of much statistical modelling is to investigate the relationship between a criterion (dependent) variable and a set predictor (independent) variables. But for a number of studies related to the modelling of property prices on the selected local market, the primary objective remains to predict the price for the next element of the market. The more properties defining the local market are similar to the valued element (the appraisal subject), the more undertaken prediction is convincing. This evaluation might be weak from a statistical point of view. But when the valued element is close to the nearest sold properties (via their feature evaluations), their sales prices are the best data to predict anything among them.

Therefore, in this type of real estate modelling applications, it is important to emphasize the mutual similarity factor. Its presence in the price model was proved in Zyga (2016 in Zyga (2019). On the other hand, this study focuses on the importance of the similarity criterion for the price model construction. The valued object was considered a benchmark for assessing this similarity.

To make the problem easier to describe, a dissimilarity factor is used in the next steps. The difference (dissimilarity) d_i,j, between the a_i,j element of the set A and the proper f_{V_1,j} element of set F_V describing the subject of the valuation is defined as:

(16)

di,j=ai,j−fV1,j,

where:

a_i,j is the i, j element of matrix A,
i = 1, 2, . . . , n, i ∈ ℕ, where n is a number of observations y_A,
j = 0, 1, 2 . . . , k, j ∈ ℕ, where k is a number of structural parameters x_A,
a_i0 = 1,
f_{V_1,j} is the j element of the single-line matrix of the F_V pattern.

For the whole set of data, we obtain respectively:

(17)

DA=A−[1]FV,

where:

D_A is the matrix of dissimilarity (the differences matrix) between the sold properties with evaluations collected in matrix A and the appraised property; dimensions in rows and columns (n × k);
F_V is the vector comprising evaluation marks given to the appraised property; it is the single-line matrix with dimensions (1 × k);
[1] is the vector of elements equal to 1; dimensions in rows x columns (n × 1).

The small example of D_A matrix creating is shown in equation (18) (the juxtaposition of the assumed matrix A, the multipicated pattern single-line matrix F_V (F_V = [4 1 5 1]), and the dissimilarity matrix D_A). Matrices are set in the same order as in (17).

(18)

1234︷dissimilarity matrix DA[…………4−141−20−40−2004−2400000004−4410−4310−34…………]1234︷assumed matrix A[…………15162111215525514151451551145125…………]1234︷multiplicated FA pattern[…………41514151415141514151415141514151…………]5678910111213…

Putting (17) in (7) gives us a solution for YVA^ revealing an involvement of dissimilarity factor on predicting process (Zyga, 2019):

(19)

YVA^−FV[([1]FV+DA)T([1]FV+DA)]−1([1]FV+DA)TYA,

The equation (19) can be easily modified back to the equation (7) because modification (19) does not bias the solution in any way. But it shows how the dissimilarity factor works within this structure and proves that the connections between the dissimilarity and the final effect of estimation in the LS method really exist. The matrix D_A gives the possibility of easy selection from the whole collected set of comparables (Y_A, A), such as sales prices y_{A_i} with proper subsets of market feature evaluations a_ij (recorded in proper rows of matrix A) that their dissimilarity indicators (16) are the lowest. This step creates a smaller set of selected comparables (Y_A, A). By reducing the row number of matrix A, the variability of variables represented by individual columns of A is inevitably limited. This, in turn, forces the rejection of these variables from X_A as too poorly correlated with the new vector Y_A. This reduction is the last step in modifying of selected comparables and gives finally new shape of data set still named (Y_A, A) but with reduced n as a number of observations and new k as a number of properly significant variables.

The reduction of initially collected set of comparables (Y_A, A) must be supervised under assumed criterion indicators. Dissimilarity d_i,j of j-th component of i-th comparable object description and the proper elements of benchmark evaluation vector F_V can be accepted when

(20)

di,j≤dmaxj,

(21)

dmaxjKmax(max(aij)−min(aij)),

where:

i = 1, 2, . . . , n_initial, i ∈ ℕ, where n_initial is a number of initially collected observations y_A,
j = 0, 1, 2, . . . , k_initial, j ∈ ℕ, where k_initial is a number of initially defined structural parameters x_A,
K_max ∈ (0, 1〉 and is a number assumed by the tester for sub-sequent tests.

To calculate d_maxj, a certain criterion value K_max assumed by the tester for each tests is needed. It can take values from 0 to 1 and creates the upper limit of accepted dissimilarities d_i,j.

In a similar way, a correlation between modified vectors A^〈^j^〉 and modified Y_A must be controlled afterwards. For this purpose, a next criterion indicator is used. This indicator K_corrmin is a declared minimum limit of accepted correlation r_{Y_A,A_j}, so:

(22)

|rYA,Aj|=|cov(YA,A⟨j⟩)σYAσA⟨l⟩|≥Kcorrmin,

where:

cov (Y_A, A^〈^j^〉) is a covariance between indicated variables,
σ_– is standard deviation operator of selected variable, K_corrmin ∈ (0, 1〉 and is a number assumed by the tester for subsequent tests.
Technically, K_corrmin can take values greater than 0 up to 1, but for practical use, it should be greater than 0.2 or even 0.3. On other side, it should not be greater than 0.5. Under real estate circumstances, the forcing of solutions based on the strong correlations (K_corrmin greater than 0.5) can create empty sets of the reduced matrix A with no real answer about significance of any predictor variables. In the tests carried out and described below, K_corrmin values were assumed in the range from 0.05 up to 0.6 in order to test the margin results also.

4. Simulation experiment

The effect of the new parameter, which is K_max criterion, on the selection of the set of initially collected records (Y_A, A) and on the accuracy of estimates made with the use of such (limited) subsets, have been tested in the experiments described below. In several experiments, reduced sets of comparable properties, collected separately for each final appraisal, included properties described by the condition (20) with respect to (21).

The accuracy of the obtained estimations was evaluated by the difference Q (14) between the response value YVA^ and it’s the reference (‘true’) value Y_VR, that was modelled directly on the base of model parameters’ values X_A, assumed on the start of simulation:

(23)

YVR=AXA.

To illustrate a possible variability of the price estimator results on the performed property market, the artificial (simulated) picture of some local market was taken into account. Simulated market (it was demonstrated earlier in (Zyga, 2019)) was represented by 104 real properties randomly drawn from 625 = 5⁴ possible records (because of 4 features described by marks from 1 to 5). Each simulated object had its own set of feature marks (without any detailed description). Although real estate attributes are usually qualitative variables (measured on ordinal scale), in the experiment, simulated marks can be understood as measurements of several attributes made of interval scale.

Each simulated property was priced as Y_A = AX_A and biased with ε _A, accordingly to (2). Errors ε _A were specified randomly with an assumed margin of standard deviation: E(ER) = 0.000, max(ER) = 0.593, min(ER) = –0.620, std(ER) = 0.200, ε _{A_i} = (1 + ER_i)AX_A.

True (modelled) values of the above properties were calculated simultaneously as (23).

With this data, each from 104 simulated properties was appraised, but not once. The performed market was analysed each time, whether it was similar enough for each property that was to be appraised. Each appraisal was performed several times, with each assumed values of K_max and K_corrmin. At least 16014 single experiments were performed.

K_max was a criterion factor affecting the accepted maximum dissimilarity represented by the distance d_maxj (21) between the compared properties (each comparable and appraised one). In each appraisal process, K_max limited a selection set of accepted (enough similar) comparables taken to the next steps of the process, accordingly to the rule (20). K_max varied from 0.25 to 1.00. For technical reasons and the expected number of tests, the K_max-values were assumed at the fixed interval of 0.05. Afterwards, the results themselves have been arranged in three ranges of values, that one can see on Figure 1. This provides an indirect evidence that, with limited variability in the value of the a_ij, there was no need to take a smaller jump of the tested K_max. Values K_max ∈ 〈0.25, 0.45〉 created a sharp criterion of dissimilarity, that forced the selection of very similar properties only (almost the same as an appraised one). On other end of the scope, values K_max ∈ 〈0.75, 0.95〉 let to collect for calculations almost all properties from the prepared set. The middle interval with K_max ∈ 〈0.50, 0.70〉 let to take into account typically similar properties. The edge of the scope were K_max = 1 gave the opportunity to take into account all prepared set of 104 ‘sold’ properties. The only cleaning of the start set that was made in this case was the rejection of the subjects with outstanding prices.

Figure 1

Chart of the total, relative assessment (TRA) of the estimation accuracy according to the imposed criteria parameters K_max and K_corrmin (22)

https://rgg.edu.pl/f/fulltexts/173254/j_rgg-2020-0002_fig_001_min.jpg

Next, the selection of significant variables, in each case of calculation, was performed. Each case of K_max included a separate calculation with K_corrmin parameter varying from 0.05 to 0.6. K_corrmin was a criterion factor affecting the accepted minimum significance. K_corrmin ∈ 〈0.05, 0.25〉 gave a very weak criterion for the rejection of variables (a case formally unacceptable from statistical point of view) and allowed the algorithm to accept all or almost all initially prepared variables. A weak correlation between vectors A^〈^j^〉 and Y_A (K_corrmin ∈ 〈0.30, 0.35t〉) sometimes caused the rejections of some variables as not significant, while in case of K_corrmin ∈ 〈0.40, 0.60〉, most of variable were rejected. Therefore, calculations with K_corrmin > 0.60 (most wanted from a technological point of view) gave no advantages in the performed research.

Due to the huge number of single calculations, as well as final results, it was decided to show them in the aggregated way. Each single experiment gave as a result the estimation YVA^ made with the use of (7) with corresponding reference (‘true’) value Y_VR (23). This led to the calculation of the difference Q=YVR−YVA^ (14), that shows how accurate the result YVA^ is. Because it was difficult to show all 16014 results, they were summarised in the matrix with reference to values of K_max and K_corrmin. For easier understanding, each collected sum was performed as the quotient with reference to minimum value from the collected sums (24):

(24)

∑Kmax=0.251∑Kcorrmin=0.050.6(QKmax,Kcorrmin)min[∑Kmax=0.251∑Kcorrmin=0.050.6(QKmax,Kcorrmin)]

The matrix of such proportional and aggregated results is shown in Table 1 and Figure 1 respectively.

Table 1

Total, relative assessment (TRA) of the estimation accuracy according to the imposed criteria parameters K_max and K_corrmin (22)

K_corrmin
K_max	0.05	0.10	0.15	0.20	0.25	0.30	0.35	0.40	0.45	0.50	0.55	0.60
0.25								1.8740	1.8476	1.7331	1.6677	1.6501
0.30								1.8740	1.8476	1.7331	1.6677	1.6501
0.35								1.8740	1.8476	1.7331	1.6677	1.6501
0.40								1.8740	1.8476	1.7331	1.6677	1.6501
0.45								1.8740	1.8476	1.7331	1.6677	1.6501
0.50	1.7324	1.7320	1.6219	1.6612	1.6077	1.4979	1.4698	1.4746	1.4764	1.4721	1.4797	1.4730
0.55	1.7324	1.7320	1.6219	1.6612	1.6077	1.4979	1.4698	1.4746	1.4764	1.4721	1.4797	1.4730
0.60	1.7324	1.7320	1.6219	1.6612	1.6077	1.4979	1.4672	1.4746	1.4764	1.4721	1.4797	1.4730
0.65	1.7324	1.7320	1.6219	1.6612	1.6077	1.4979	1.4698	1.4746	1.4764	1.4721	1.4797	1.4730
0.70	1.7324	1.7320	1.6219	1.6612	1.6077	1.4979	1.4698	1.4746	1.4764	1.4721	1.4797	1.4730
0.75	1.1612	1.3553	1.3024	1.2985	1.4473	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744
0.80	1.1612	1.3553	1.3024	1.2985	1.4473	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744
0.85	1.1612	1.3553	1.3024	1.2985	1.4473	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744
0.90	1.1612	1.3553	1.3024	1.2985	1.4473	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744
0.95	1.1612	1.3553	1.3024	1.2985	1.4473	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744	1.5744
1.00	1.0000	1.0913	1.0913	1.0913	1.6802	1.6802	1.6802	1.6802	1.6802	1.6802	1.6802	1.6802

5. Experiment results and conclusions

It was shown that nominally, the best aggregated results of accuracy (proportional cumulated accuracy factor equals 1) are obtained with configuration { K_max = 1|K_corrmin = 0.05 }. The next minimal aggregated results have occurred for the configuration }K_max = 1|K_corrmin ∈ 〈0.0.10, 0.20〉 } when the result is 1.0913 and K_max ∈ 〈0.75, 0.95〉|K_corrmin = 0.05 } with result equal 1.1612. Within the range of coefficients discussed above, one can also find another slight, local minimum (1.2985). All of that means that all (or almost all) data initially collected in the set (Y_A, A) can be used in each discussed calculation. Moreover, the numbers in Table 1 as well as the chart, show that while K_max is greater than 0.7 – the stronger a correlation remand is, the worst value estimations can be obtained. In other words, this conclusion could mean that there is no need to respect any similarity issues as well as a signification problem of independent variables. But it is not true. Skipping this unacceptable proposal, one can also find that within the rest of possible configurations of K_max and K_corrmin different conclusions are also to be drawn. For ranges of K_max ∈ 〈0.25, 0.45〉 as well as K_max ∈ 〈0.50, 0.70〉, the relationship between proportional, aggregated results of estimation accuracy and the criterion factor K_corrmin indicates the opposite tendency: better result of accuracy can be obtained when the correlation demand arises.

The most interesting result of the above investigation is that when the criterion of the demand of correlation level between vectors A^〈j〉 and Y_A is medium or strong (K_corrmin ≥ 0.30), then the similarity issue starts to be significant. It is to be noticed that in each column in Table 1, when K_corrmin ≥ 0.30, or in each corresponding cross-section (Figure 2) of the surface chart on Figure 1, the total, relative assessment (TRA) of accuracy is the lowest for medium range of K_max (K_max ∈ 〈0.50, 0.70〉). Moreover, within this range, it has its local minimum equal 1.4672 for K_max = 0.60 and K_corrmin = 0.354, telling that the most accurate estimations (in the sense of (14)) were obtained when the criterion of dissimilarity (21) was strong enough (K_max = 0.60) and the demand of correlation level of independent variable was in the medium range. Each extreme demand on similarity of comparables or signification of descriptive variables acted against the effectiveness of estimations process as well as against the estimation accuracy.

Figure 2

The cross-sections of the surface chart of TRA at selected lines of K_corrmin. Notice that for K_corrmin < 0.40 no numerical results were obtained.

https://rgg.edu.pl/f/fulltexts/173254/j_rgg-2020-0002_fig_002_min.jpg

These considerations and the results of the experiment show that, in the estimation process based on a linear price model and being resolved by n the LS method, there are some conditions not considered yet which may affect the prediction relevance of endogenous variable. The essence of these conditions is contained in the relationship between the characteristics of the subjects taken into the analysis. In the conducted experiment, the role of the endogenous variable has been assigned as the unit price of a hypothetical property. The starting point for the conducted considerations was the real estate market, since the prediction quality and its reliability are specifically conditioned by the question of similarity on that market. The carried out studies demonstrate that the link between the similarity (or dissimilarity) of the sold properties, used as comparables and the valued property, affects the undertaken estimations’ accuracy. The observations made show that there is a niche that is worthy of further research.

eISSN:	2391-8152
ISSN:	2391-8365