logistic regression, in statistics, a method for modeling conditional probabilities with discrete (usually binary) outcomes. In a logistic model (sometimes called a logit model), the probability of a discrete, categorical outcome variable (e.g., “yes/no” or “pass/fail/incomplete”) is modeled on the basis of one or more predictor variables, which are typically continuous rather than discrete. There are several reasons that logistic regression is favored over the more common linear regression when dealing with discrete outcomes. Most notably, when linear regression is applied to find the probability of discrete variables, the result can be probabilities greater than 1 or less than 0. Logistic regression is a common tool used in the sciences, particularly in the growing fields of machine learning and natural language processing.
Logistic regression is a nonlinear regression, meaning that the relationship between a predictor (independent) variable and the outcome (dependent) variable is not linear. Instead, the outcome variable undergoes a logit transformation, which involves finding the logarithm of the outcome odds (the logarithm of the ratio of the probability of the outcome being true to the probability of the outcome being false). The result is that the regression line is an S-shaped curve rather than a straight line.
Depending on the nature of the outcome variable, three types of logistic regression may be used: binary, in which there are only two possible outcomes (e.g., “true/false”); nominal, in which there may be several outcomes with no clear order (e.g., the suit in a deck of cards); or ordinal, in which there may be several categories that have a natural order (e.g., letter grades from F to A). A logistic regression model can include a single predictor variable, or it can contain multiple such variables, each variable contributing to the probability of the outcome variable falling into a particular category.
For a binary logistic regression with multiple predictive variables, the model can be expressed asp = e(β0 + x1β1 + x2β2 + … + xkβk)/1 + e(β0 + x1β1 + x2β2 + … + xkβk),where p is the probability of the outcome occurring; e is the base of the natural logarithm (also called Euler’s number; about 2.718); x1, x2, …, xk are the k predictive variables; and β0, β1, …, βk are the k regression coefficients (each being the degree to which a factor affects the probability of the outcome occurring).
For example, suppose one wants to track the relationship between the number of hours a student in a course studied, the number of homework assignments the student turned in, and the probability that the student will pass an exam. Using statistical software to perform a regression analysis reveals that the base regression coefficient β0 is −3.57, meaning that having done no studying and no homework assignments negatively affects the probability of passing the exam. However, the coefficient (β1) for hours spent studying is 0.75, and the coefficient (β2) for homework assignments is 0.44, meaning that, as each of these variables increases, the student’s probability of passing also increases. Suppose the student spent three hours studying (x1 = 3) and turned in four homework assignments (x2 = 4). The equation above can be filled out asp = e[−3.57 + (3 × 0.75) + (4 × 0.44)]/1 + e[−3.57 + (3 × 0.75) + (4 × 0.44)],which simplifies top = 2.7180.84/1 + 2.7180.84 ≈ 0.698.According to the model, the student has an almost 70 percent probability of passing the exam.
Logistic regression is used as a linear classifier, meaning that it can determine a clear boundary between predicted classes. For example, suppose one wants to predict whether a city has a subway system (a binary outcome variable) on the basis of its population (a continuous predictor variable). Performing a logistic regression will produce a probability curve that tells at what population size a city goes from being likely to have no subway to being likely to have a subway. Cities can then be classified into these two categories on the basis of their populations, and predictions can be made about whether particular cities have a subway or not.
This use as a classifier makes logistic regression a valuable tool in machine learning, where it is used as a discriminative classifier, meaning that it helps a model distinguish between different classes of item. For example, a logistic model may be used to help a program distinguish between different characters in optical character recognition. The program could determine the traits of an image (creating predictive variables) and build a probability model for each potential character (on the basis of a set of training data), eventually assigning the image to one of the characters in its model on the basis of those probabilities. However, logistic regression goes beyond mere classification: it not only generates boundaries but also generates levels of probability on the basis of an item’s distance from a boundary.