Correlation coefficient 0.7 means. What does the concept of correlation mean in simple terms? Use of software in correlation analysis

Stage 3. Finding the relationship between the data

Linear Correlation

The last stage of the task of studying the relationships between phenomena is the assessment of the tightness of the connection according to the correlation indicators. This stage is very important for identifying dependencies between factor and resultant signs, and, consequently, for the possibility of diagnosing and predicting the phenomenon under study.

Diagnosis(from the Greek. diagnosis recognition) - determination of the essence and features of the state of an object or phenomenon on the basis of its comprehensive study.

Forecast(from the Greek. prognosis foresight, prediction) - any specific prediction, judgment about the state of a phenomenon in the future (weather forecast, election outcome, etc.). Forecast is a scientifically substantiated hypothesis about the probable future state of the system, object or phenomenon under study and indicators characterizing this state. Forecasting is the development of a forecast, special scientific studies of specific prospects for the development of a phenomenon.

Recall the definition of correlation:

Correlation- dependence between random variables, expressed in the fact that the distribution of one variable depends on the value of another variable.

Correlation is observed not only between quantitative, but also qualitative features. Exists various ways and indicators for assessing the closeness of ties. We will only focus on linear pair correlation coefficient , which is used when there is a linear relationship between random variables. In practice, it often becomes necessary to determine the level of connection between random variables of unequal dimensions, so it is desirable to have some dimensionless characteristic of this connection. Such a characteristic (measure of connection) is the coefficient linear correlation rxy, which is determined by the formula

where , .

Denoting and , you can get the following expression for calculating the correlation coefficient

.

If we introduce the concept normalized deviation , which expresses the deviation of the correlated values ​​from the mean in fractions of the standard deviation:



then the expression for the correlation coefficient will take the form

.

If the correlation coefficient is calculated based on the final values ​​of the initial random variables from the calculation table, then the correlation coefficient can be calculated by the formula

.

Properties of the linear correlation coefficient:

one). The correlation coefficient is a dimensionless quantity.

2). |r| £1 or .

3). , a,b= const, - the value of the correlation coefficient will not change if all values ​​of random variables X and Y are multiplied (or divided) by a constant.

4). , a,b= const, - the value of the correlation coefficient will not change if all values ​​of random variables X and Y are increased (or decreased) by a constant.

5). There is a relationship between the correlation coefficient and the regression coefficient:

The values ​​of the correlation coefficients can be interpreted as follows:

Quantitative criteria for assessing the closeness of communication:

For prognostic purposes, quantities with |r| > 0.7.

The correlation coefficient allows us to conclude that there is a linear relationship between two random variables, but does not indicate which of the variables causes a change in the other. In fact, a relationship between two random variables can exist without a causal relationship between the variables themselves, because a change in both random variables can be caused by a change (influence) of the third.

Correlation coefficient rxy is symmetric with respect to the considered random variables X and Y. This means that in order to determine the correlation coefficient, it is completely indifferent which of the quantities is independent and which is dependent.

Significance of the correlation coefficient

Even for independent quantities, the correlation coefficient may turn out to be non-zero due to random scatter of measurement results or due to a small sample of random variables. Therefore, the significance of the correlation coefficient should be checked.

The significance of the linear correlation coefficient is tested based on Student's t-test :

.

If t > t cr(P, n-2), then the linear correlation coefficient is significant, and therefore, the statistical relationship is also significant X and Y.

.

For the convenience of calculations, tables of values ​​of the confidence limits of the correlation coefficients for a different number of degrees of freedom have been created. f = n–2 (two-tailed test) and different levels of significance a= 0.1; 0.05; 0.01 and 0.001. It is considered that the correlation is significant if the calculated correlation coefficient exceeds the value of the confidence limit of the correlation coefficient for the given f and a.

For big n and a= 0.01 the value of the confidence limit of the correlation coefficient can be calculated using the approximate formula

.

In Chapter 4, we looked at basic univariate descriptive statistics—measures of central tendency and variability—that are used to describe a single variable. In this chapter, we will look at the main correlation coefficients.

Correlation coefficient- two-dimensional descriptive statistics, a quantitative measure of the relationship (joint variability) of two variables.

The history of the development and application of correlation coefficients for the study of relationships actually began simultaneously with the emergence of a measuring approach to the study of individual differences - in 1870-1880. The pioneer in measuring human abilities, as well as the author of the very term "correlation coefficient", was Francis Galton, and the most popular correlation coefficients were developed by his follower Karl Pearson. Since then, the study of relationships using correlation coefficients has been one of the most popular activities in psychology.

To date, a great many different correlation coefficients have been developed, hundreds of books are devoted to the problem of measuring the relationship with their help. Therefore, without claiming to be complete, we will consider only the most important, really indispensable in research measures of connection - /--Pearson, r-Spearman and m-Kendall. Their common feature is that they reflect the relationship of two features measured on a quantitative scale - rank or metric.

Generally speaking, any empirical study focuses on the study of the relationship of two or more variables.

EXAMPLES

Let us give two examples of the study of the influence of the demonstration of scenes of violence on TV on the aggressiveness of adolescents. 1. The relationship of two variables measured in a quantitative (rank or metric) scale is being studied: 1) “time of watching television programs with violence”; 2) "aggressiveness".

Reads like Tau-Kendall.


CHAPTER 6. CORRELATION COEFFICIENTS

2. We study the difference in the aggressiveness of 2 or more groups of adolescents, differing in the duration of watching TV shows with demonstrations of scenes of violence.

In the second example, the study of differences can be represented as a study of the relationship between 2 variables, one of which is nominative (duration of TV viewing). And for this situation, their own correlation coefficients have also been developed.

Any study can be reduced to the study of correlations, since a variety of correlation coefficients have been invented for almost any research situation. But in what follows, we will distinguish between two classes of problems:

P study of correlations - when two variables are presented on a numerical scale;

difference study - when at least one of the two variables is presented in the nominative scale.


This division also corresponds to the logic of building popular computer statistical programs, in which the menu Correlations three coefficients are proposed (/--Pearson, r-Spearman and x-Kendall), and for solving other research problems, methods for comparing groups are proposed.

THE CONCEPT OF CORRELATION

Relationships in the language of mathematics are usually described using functions that are graphically represented as lines. On fig. 6.1 shows several graphs of functions. If a change in one variable by one unit always results in a change in the other variable by the same amount, the function is linear(its graph is a straight line); any other connection non-linear. If an increase in one variable is associated with an increase in another, then the relationship is positive (straight); If an increase in one variable is associated with a decrease in another, then the relationship is negative (reverse). If the direction of change of one variable does not change with the increase (decrease) of another variable, then such a function is monotonous; otherwise the function is called nonmonotonic.

functional links, similar to those shown in Fig. 6.1 are idealizations. Their peculiarity lies in the fact that one value of one variable corresponds to a strictly defined value of another variable. For example, such is the relationship of two physical variables - weight and body length (linear positive). However, even in physical experiments, the empirical relationship will differ from the functional relationship due to unaccounted for or unknown reasons: fluctuations in the composition of the material, measurement errors, etc.

Rice. 6.1. Examples of Graphs of Frequently Occurring Functions

In psychology, as in many other sciences, when studying the relationship of features, the researcher inevitably loses sight of many possible reasons for the variability of these features. The result is that even the functional relationship between variables that exists in reality appears empirically as probabilistic (stochastic): the same value of one variable corresponds to the distribution of different values ​​of another variable (and vice versa). The simplest example is the ratio of height and weight of people. Empirical results of the study of these two signs will, of course, show their positive relationship. But it is easy to guess that it will differ from a strict, linear, positive - ideal mathematical function, even with all the researcher's tricks to take into account the harmony or fullness of the subjects. (It is unlikely that on this basis it would occur to anyone to deny the existence of a strict functional relationship between the length and weight of the body.)

So, in psychology, as in many other sciences, the functional interconnection of phenomena can be empirically revealed only as a probabilistic connection of the corresponding features. A visual representation of the nature of the probabilistic relationship gives scatter diagram - a graph whose axes correspond to the values ​​of two variables, and each subject is a point (Fig. 6.2). Correlation coefficients are used as a numerical characteristic of a probabilistic connection.

In scientific research, it often becomes necessary to find a relationship between the resultant and factor variables (the yield of a crop and the amount of precipitation, the height and weight of a person in homogeneous groups by sex and age, pulse rate and body temperature, etc.).

The second are signs that contribute to the change of those associated with them (the first).

The concept of correlation analysis

There is a set Based on the above, we can say that correlation analysis is a method used to test the hypothesis of the statistical significance of two or more variables, if the researcher can measure them, but not change them.

There are other definitions of the concept under consideration. Correlation analysis is a processing method that examines the correlation coefficients between variables. In this case, the correlation coefficients between one pair or multiple pairs of features are compared to establish statistical relationships between them. Correlation analysis is a method for studying the statistical dependence between random variables with the optional presence of a strict functional nature, in which the dynamics of one random variable leads to the dynamics of the mathematical expectation of another.

The concept of false correlation

When conducting a correlation analysis, it must be taken into account that it can be carried out in relation to any set of features, often absurd in relation to each other. Sometimes they have no causal connection with each other.

In this case, one speaks of a spurious correlation.

Problems of correlation analysis

Based on the above definitions, we can formulate the following tasks of the described method: obtain information about one of the desired variables using the other; determine the closeness of the relationship between the variables under study.

Correlation analysis involves determining the relationship between the studied features, and therefore the tasks of correlation analysis can be supplemented with the following:

  • identification of factors that have the greatest impact on the resultant sign;
  • identification of previously unexplored causes of relationships;
  • building a correlation model with its parametric analysis;
  • study of the significance of communication parameters and their interval estimation.

Connection of correlation analysis with regression

The method of correlation analysis is often not limited to finding the closeness of the relationship between the studied quantities. Sometimes it is supplemented by the compilation of regression equations, which are obtained using the analysis of the same name, and which are a description of the correlation between the resulting and factorial (factorial) attribute(s). This method, together with the analysis under consideration, constitutes the method

Conditions for using the method

Outcome factors depend on one or more factors. The method of correlation analysis can be used if there are a large number of observations on the value of effective and factor indicators (factors), while the studied factors should be quantitative and reflected in specific sources. The first can be determined by the normal law - in this case, the Pearson correlation coefficients are the result of the correlation analysis, or, if the signs do not obey this law, the Spearman rank correlation coefficient is used.

Rules for selecting factors of correlation analysis

When applied this method it is necessary to determine the factors influencing the performance indicators. They are selected taking into account the fact that there must be causal relationships between the indicators. In the case of creating a multifactorial correlation model, those that have a significant impact on the resulting indicator are selected, while interdependent factors with a pair correlation coefficient of more than 0.85 should preferably not be included in the correlation model, as well as those in which the relationship with the resultant parameter is indirect. or functional.

Results display

The results of correlation analysis can be presented in text and graphic forms. In the first case, they are presented as a correlation coefficient, in the second, as a scatterplot.

If there is no correlation between the parameters, the points on the diagram are located randomly, the average degree of connection is characterized by a greater degree of order and is characterized by a more or less uniform distance of the marked marks from the median. A strong connection tends to a straight line and at r=1 the scatter plot is a flat line. An inverse correlation is characterized by the direction of the graph from the upper left to the lower right, a direct one - from the lower left to the upper right corner.

3D representation of a scatterplot (scattering)

In addition to the traditional 2D scatterplot presentation, a 3D graphical representation of the correlation analysis is currently used.

A scatterplot matrix is ​​also used, which displays all paired plots in a single figure in a matrix format. For n variables, the matrix contains n rows and n columns. The diagram located at the intersection of the i-th row and the j-th column is a graph of the variables Xi compared to Xj. Thus each row and column is one dimension, a single cell displays a scatterplot of the two dimensions.

Estimation of tightness of connection

The tightness of the correlation is determined by the correlation coefficient (r): strong - r = ±0.7 to ±1, medium - r = ±0.3 to ±0.699, weak - r = 0 to ±0.299. This classification is not strict. The figure shows a slightly different scheme.

An example of applying the correlation analysis method

An interesting study was undertaken in the UK. It is devoted to the relationship of smoking with lung cancer, and was carried out by correlation analysis. This observation is presented below.

Initial data for correlation analysis

Professional group

mortality

Farmers, foresters and fishermen

Miners and quarry workers

Producers of gas, coke and chemicals

Glass and ceramic manufacturers

Workers in furnaces, forges, foundries and rolling mills

Electrical and electronics workers

Engineering and related professions

Woodworking production

Tanners

Textile workers

Workwear manufacturers

Workers in the food, beverage and tobacco industry

Paper and printing manufacturers

Producers of other products

Builders

Artists and decorators

Drivers of stationary engines, cranes, etc.

Workers not included elsewhere

Transport and communications workers

Warehouse workers, storekeepers, packers and filling machine workers

office workers

Sellers

Sports and recreation service workers

Administrators and managers

Professionals, technicians and artists

We begin the correlation analysis. It is better to start the solution for clarity with a graphical method, for which we will build a scatter (scatter) diagram.

She shows a direct connection. However, it is difficult to draw an unambiguous conclusion based on the graphical method alone. Therefore, we will continue to perform correlation analysis. An example of calculating the correlation coefficient is shown below.

Using software tools (on the example of MS Excel, it will be described below), we determine the correlation coefficient, which is 0.716, which means a strong relationship between the studied parameters. Let's determine the statistical significance of the obtained value according to the corresponding table, for which we need to subtract 2 from 25 pairs of values, as a result we get 23 and for this line in the table we find r critical for p = 0.01 (since these are medical data, a more strict dependence, in other cases p=0.05 is enough), which is 0.51 for this correlation analysis. The example demonstrated that the calculated r is greater than the critical r, the value of the correlation coefficient is considered statistically significant.

Use of software in correlation analysis

The described type of statistical data processing can be carried out using software in particular MS Excel. Correlation involves the calculation of the following parameters using functions:

1. The correlation coefficient is determined using the CORREL function (array1; array2). Array1,2 is a cell of the range of values ​​of the resulting and factor variables.

The linear correlation coefficient is also called the Pearson correlation coefficient, and therefore, starting with Excel 2007, you can use the function with the same arrays.

Graphical display of correlation analysis in Excel is done using the "Charts" panel with the "Scatter Plot" selection.

After specifying the initial data, we get a graph.

2. Evaluation of the significance of the pair correlation coefficient using Student's t-test. The calculated value of the t-criterion is compared with the tabular (critical) value of this indicator from the corresponding table of values ​​of the parameter under consideration, taking into account the given level of significance and the number of degrees of freedom. This estimation is done using the STUDIV(probability; degrees_of_freedom) function.

3. Matrix of pair correlation coefficients. Analysis is performed using the "Data Analysis" tool, in which "Correlation" is selected. Statistical evaluation of the pair correlation coefficients is carried out by comparing its absolute value with the tabular (critical) value. When the calculated pair correlation coefficient exceeds that critical one, we can say, taking into account the given degree of probability, that the null hypothesis about the significance of the linear relationship is not rejected.

Finally

The use of the method of correlation analysis in scientific research makes it possible to determine the relationship between various factors and performance indicators. At the same time, it should be taken into account that a high correlation coefficient can also be obtained from an absurd pair or set of data, and therefore this species analysis should be carried out on a sufficiently large data array.

After obtaining the calculated value of r, it is desirable to compare it with r critical to confirm the statistical significance of a certain value. Correlation analysis can be carried out manually using formulas, or using software tools, in particular MS Excel. Here you can also build a scatter (scatter) diagram for the purpose of a visual representation of the relationship between the studied factors of the correlation analysis and the resulting feature.

The correlation coefficient is the degree of association between two variables. Its calculation gives an idea of ​​whether there is a relationship between two data sets. Unlike regression, correlation does not allow predicting values. However, the calculation of the coefficient is an important step in the preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and GDP growth is high. This gives us an idea that in order to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not so obvious conclusion at first glance!

Correlation and causation

Perhaps there is not a single area of ​​statistics that would be so firmly established in our lives. The correlation coefficient is used in all areas of public knowledge. Its main danger lies in the fact that often its high values ​​​​are speculated in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a causal relationship between the quantities.

Correlation coefficient: Pearson and Spearman formula

There are several main indicators that characterize the relationship between two variables. Historically, the first is Pearson's linear correlation coefficient. It is passed at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This ratio allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Equal to a positive number - there is a directly proportional relationship between the studied quantities. Spearman's rank correlation coefficient makes it possible to simplify calculations by constructing a hierarchy of variable values.

Relationships between variables

Correlation helps answer two questions. First, whether the relationship between variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool with which you can get this important information. It is easy to see that household incomes and expenses rise and fall proportionately. Such a relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. Such a relationship is called negative. The values ​​of the correlation coefficient are between -1 and 1. Zero means that there is no relationship between the studied values. The closer the indicator to the extreme values, the stronger the relationship (negative or positive). The absence of dependence is evidenced by a coefficient from -0.1 to 0.1. It must be understood that such a value only indicates the absence of a linear relationship.

Application features

The use of both indicators is subject to certain assumptions. First, the presence of a strong relationship does not determine the fact that one value determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a causal relationship between the variables under study. Thirdly, it shows exclusively linear dependence. Correlation can be used to evaluate meaningful quantitative data (eg barometric pressure, air temperature) rather than categories such as gender or favorite color.

Multiple correlation coefficient

Pearson and Spearman investigated the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes in. For example, the gross national product is affected not only by foreign direct investment, but also by the monetary and fiscal policies of the state, as well as the level of exports. The growth rate and the volume of GDP are the result of the interaction of a number of factors. However, it should be understood that the multiple correlation model is based on a number of simplifications and assumptions. First, multicollinearity between quantities is excluded. Second, the relationship between the dependent variable and the variables that affect it is assumed to be linear.

Areas of use of correlation and regression analysis

This method of finding the relationship between quantities is widely used in statistics. It is most often resorted to in three main cases:

  1. For testing causal relationships between the values ​​of two variables. As a result, the researcher hopes to find a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
  2. To check for a relationship between values. In this case, no one determines which variable is dependent. It may turn out that the value of both quantities determines some other factor.
  3. To derive an equation. In this case, you can simply substitute numbers into it and find out the values ​​of the unknown variable.

A man in search of a causal relationship

Consciousness is arranged in such a way that we definitely need to explain the events that occur around. A person is always looking for a connection between the picture of the world in which he lives and the information he receives. Often the brain creates order out of chaos. He can easily see a causal relationship where there is none. Scientists have to specifically learn to overcome this trend. The ability to evaluate relationships between data is objectively essential in an academic career.

Media bias

Consider how the presence of a correlation can be misinterpreted. A group of ill-behaved British students were asked if their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parents' smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there are a number of problems with this conclusion. First, the correlation does not indicate which of the quantities is independent. Therefore, it is quite possible to assume that bad habit parents is caused by the disobedience of children. Secondly, it is impossible to say with certainty that both problems did not arise due to some third factor. For example, low-income families. It should be noted the emotional aspect of the initial conclusions of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his study in this way.

conclusions

Misinterpreting correlation as a causal relationship between two variables can lead to embarrassing research errors. The problem is that it lies at the very core of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between causation and correlation allows you to rationally analyze information as in Everyday life as well as in professional careers.

    Correlation and regression analysis as two basic tools for the analysis of two-dimensional quantitative data.

    Characteristics and tasks of correlation-regression analysis.

    Correlation analysis. Scatter diagram. Correlation coefficient. Interpretation of the correlation coefficient.

    Regression analysis. Regression equation. The standard error of the estimate. Determination coefficient.

  1. Correlation and regression analysis as two basic tools for the analysis of two-dimensional quantitative data.

If, with a change in the value of one of the variables, the second variable can, within certain limits, with some probability take different meaning, and the characteristics of the second variable change according to statistical laws, then such a relationship is called statistical.

Correlation- a concept that defines the mutual dependence of two quantities. A correlation between two attributes or properties can arise in various ways. The main path is the causal dependence of one feature on another.

A correlation is a special case of a statistical relationship, when different values ​​of one variable correspond to different mean values ​​of another. For example, with a change in features, the value of the Y function changes.

However, in practice, it can be difficult to identify cause and effect.

Since the correlation is a statistical form, the main condition for its study is the need to have a significant amount of data on the totality of phenomena. It is usually believed that the number of observations (statistics) should be several times greater (up to 10 times) than the number of studied or taken into account factors.

Regression- a term derived from lat. regression - moving backwards. In statistical analysis, regression analysis is used to predict a desired feature based on a known one.

The relationship between two features, expressed explicitly, is a regression. The regression function is the mathematical expectation of the relationship, and deviations from it are random variables.

  1. Characteristics and tasks of correlation-regression analysis.

Correlation-regression analysis consists in establishing the degree of connection (correlation analysis) and its form, i.e. analytical expression linking features (regression analysis). Correlation - regression analysis is multidimensional, i.e. one attribute is almost always influenced by many others.

The purpose of regression analysis is to evaluate the functional dependence of the conditional average value of the effective attribute (Y) on the factorial ones (x1, x2, ..., xk).

Basic conditions for applying the correlation-regression method

1. The presence of a sufficiently large sample population. It is believed that the number of observations should exceed more than 10 times the number of factors influencing the result.

2. The presence of a qualitatively homogeneous study population.

3. The subordination of the distribution of the population according to the resultant and factor characteristics to the normal law or proximity to it. The fulfillment of this condition is due to the use of the least squares method (LSM) in calculating the correlation parameters and some others.

The main tasks of correlation and regression analysis

1. Measuring the closeness of the relationship between the effective and factor sign (features). Depending on the number of factors influencing the result, the problem is solved by calculating the correlation ratio, pair, partial, multiple correlation or determination coefficients.

2. Evaluation of the parameters of the regression equation, expressing the dependence of the average values ​​of the resulting attribute on the values ​​of the factor attribute (s). The problem is solved by calculating the regression coefficients.

3. Determination of the most important factors influencing the effective feature. The problem is solved by assessing the closeness of the relationship between factors and the result.

4. Forecasting the possible values ​​of the effective feature for given values ​​of factor features. The problem is solved by substituting the expected values ​​of the factors into the regression equation

The regression equation, or a statistical model of the relationship of socio-economic phenomena, is expressed by the function:

    Yx = f(x1, x2, …, xn),

    where "n" is the number of factors included in the model;

    Xi - factors influencing the result Y.

Stages of correlation and regression analysis:

    Preliminary (a priori) analysis.

    Collection of information and its primary processing.

    Building a model (regression equations). As a rule, this procedure is performed on a PC using standard programs.

    Evaluation of the tightness of feature relationships, evaluation of the regression equation and analysis of the model.

    Forecasting the development of the analyzed system according to the regression equation.

At the first stage the task of the study is formulated, the methodology for measuring indicators or collecting information is determined, the number of factors is determined, duplicate factors are excluded or linked into a rigidly deterministic system.

At the second stage the volume of units is analyzed: the population must be sufficiently large in terms of the number of units and observations, the number of factors "n" must correspond to the number of observations "N". The data must be quantitatively and qualitatively homogeneous.

At the third stage the form of connection and the type of the analytic function (parabola, hyperbola, straight line) are determined and its parameters are found.

At the fourth stage the reliability of all characteristics of the correlation and the regression equation is estimated using the Fisher or Student's reliability criterion, an economic and technological analysis of the parameters is performed.

At the fifth stage a forecast of possible result values ​​is carried out according to the best values ​​of factor characteristics included in the model. Here, the best and worst values ​​of the factors and the result are selected.