GIST: "We
tend to assume the near-term future of automation will be built on
man-machine partnerships. Our robot sidekicks will compensate for the
squishy inefficiencies of the human brain, while human judgment will
sand down their cold, mechanical edges. But
what if the partnership, especially in the early stages, instead
accentuates the flaws of both? For example, a formula designed to reduce
prison populations in Virginia led some judges to impose harsher
sentences for young or black defendants, and more lenient ones for
rapists. In an age when artificial intelligence is widely expected to
eat what’s left of the world, simple sentencing algorithms are a preview of the economy’s tool-assisted future. A
working paper released Tuesday
by Megan Stevenson of George Mason University and Jennifer Doleac of
Texas A and M provides one of the first examinations of the unintended
consequences that arise when algorithms and humans team up in the wild. The
algorithms are intended to remove some of the guesswork from judges’
sentencing decisions by assigning a simple risk score to defendants. In
Virginia, the score included data such as offense type, age, prior
convictions and employment status. Larceny scores higher than drug
offenses, men score higher than women, and unmarried folks score higher
than their married peers. (Marriage and employment were removed from the
score in 2013.) Similar
algorithms are now used in 28 states (and parts of seven more). In
Virginia, they were adopted statewide in 2002 to help keep prison
populations down after discretionary parole was abolished. Stevenson and
Doleac’s analysis relies on tens of thousands of felony convictions,
with a particular focus on the period between 2000 and 2004. Judges
were supposed to use risk scores to identify felons who were least
likely to reoffend and either give them shorter jail sentences or send
them to a program such as probation or substance-abuse treatment. Rather
than focus on patterns of discrimination by
algorithms or
judges acting alone, as others have done, Stevenson and Doleac measured how the two interacted. After
controlling for the effects of time, geography and demography,
Stevenson and Doleac found the ambitious new nonviolent risk assessment
system, often held up as a model for other states to follow, did not
change the rate at which people were incarcerated, the length of their
sentences or the rate at which they reoffended after release. But that
doesn’t mean it had no effect. Judges
followed the algorithm’s suggestions a bit less than half of the time.
People who the algorithm deemed high-risk received longer sentences than
they would have, and candidates assessed as low-risk got shorter ones.
The two adjustments offset each other, so the overall numbers didn’t
change, but the interaction between algorithmic sentencing
recommendations and judges’ discretion nonetheless produced perverse
effects. In
a statement, Meredith Farrar-Owens, director of the Virginia Criminal
Sentencing Commission, said the system’s goal was to avoid increasing
crime or recidivism rates while diverting the lowest-risk nonviolent
offenders from prison and freeing up space for violent offenders, who
were expected to serve longer terms under the sentencing reforms that
took effect in 1995. “While
the Commission’s risk assessment instruments were developed based on
empirical study of recidivism rates and patterns among Virginia felons,
use of risk assessment takes place within the context of a complex and
dynamic criminal justice system,” Farrar-Owens said. “Risk assessment
recommendations are advisory and only one of many factors judges will
consider when sentencing defendants.” She
added that judges’ ability to follow the algorithm’s suggestions may
also have been limited by plea agreements, and by a lack of alternative
options in their particular circuit. The
sentencing of young offenders is one example of the algorithm’s
surprising side effects. Stevenson said some judges may not realize it,
but someone’s risk score is largely a reflection of how old they are.
That’s because age is such a strong predictor of recidivism. You get
substantially more added to your risk score for being younger than 30
(13 points) than, for example, having been incarcerated five or more
previous times as an adult (9 points). In a
2018 analysis,
Stevenson and Vanderbilt’s Christopher Slobogin calculated that 58
percent of the widely used COMPAS algorithm’s violent recidivism risk
score can be attributed to age. “People
are getting this very stigmatic label — high risk for violent
recidivism — largely because they’re 19 years old,” she said. Judges
tended to be more merciful toward young defendants than the algorithm
recommended. Nonetheless, defendants younger than 23 were 4 percentage
points more likely to be incarcerated after risk assessment was adopted,
and their sentences were 12 percent longer than their older peers. “Based
on the Commission’s recidivism study, age is one of the most heavily
weighted factors on the risk assessment tool. It makes sense that, once
judges had an objective, research-based risk assessment tool, we would
see sentences of young adult offenders increase relative to older
offenders,” Farrar-Owens said. Racial
disparities also increased among those circuits that used risk
assessment most. Although computers can’t explicitly use prohibited
variables like race in their sentencing calculations, black defendants
were 4 percentage points more likely to be incarcerated after risk
assessment was adopted, compared with otherwise equivalent whites. Black
defendants’ sentences were also 17 percent longer. “This
is partially explained by the fact that black defendants have higher
risk scores, and partially because black defendants are sentenced more
harshly than white defendants with the same risk score,” Stevenson and
Doleac write. The
authors also studied a similar risk-assessment program for sex
offenders. Out of an abundance of caution, the program was built so the
algorithm could only authorize longer-than-baseline sentences — yet its
net effect was to decrease how often sex offenders were imprisoned by 5
percentage points, and to shorten their sentences by about 24 percent. Stevenson
and Doleac suggest that by assigning a sex offender a low risk score,
algorithms may help protect a judge from backlash if the offender goes
on to commit another crime. This empowers them to offer shorter
sentences than they otherwise would have meted out. “If
they sentence someone leniently and that person goes out and commits a
heinous crime, all fingers are pointed at them,” Stevenson said. “If
they make a mistake in the other direction — failing to release someone
who would have done anything if released — nobody sees that. There are
no consequences to the judge.” The University of Maryland’s Frank Pasquale, who made the case for oversight of algorithms in 2015’s “
The Black Box Society,”
said most judges aren’t adequately trained to evaluate the claims made
by sentencing systems such as the one in Virginia. Because judges have
the option to reject the algorithm’s conclusion, Pasquale said, they may
follow it only when it provides a convenient excuse. “You
can have a lot of scenarios where the AI algorithms end up being a
rationalization for what the judge wants to do,” Pasquale said. Stevenson
and Doleac write that, especially when the goals of the algorithm’s
human partners differ from those of its designers, we should expect
unexpected results. “Virginia’s
nonviolent risk assessment reduced neither incarceration nor
recidivism; its use disadvantaged a vulnerable group (the young); and
failed to reduce racial disparities,” Stevenson and Doleac write.
“Virginia’s sex offender risk assessment lowered sentences for those
convicted of rape: a group that the Sentencing Commission had targeted
for increased sentences.” Why
did the Virginia algorithm struggle? It turns out future crime is hard
to predict. “Even under ideal conditions,” Stevenson and Doleac write,
“predictions of future offending are unable to explain more than a tiny
fraction of the variation in recidivism.” Duke
University professor Brandon Garrett, who has separately studied the
Virginia system, said: “You can’t just adopt a tool and expect it to
magically solve problems. You have to put real thought and information
into implementation.” Garrett
and his collaborators spoke with judges from across Virginia. Some said
they weren’t trained or didn’t trust the formula. Some said they didn’t
even have the programs necessary to divert people from prisons. One
judge even dryly equated using the algorithm to visiting a psychic. In
an analysis forthcoming in the California Law Review, Garrett and
collaborator John Monahan of the University of Virginia write that
judges didn’t follow risk assessments consistently — and that some
didn’t use them at all. “Nor
should that be a surprise,” they write, “given that judges and other
decisionmakers typically receive almost no training in risk assessment,
and their discretion to ignore risk assessment is virtually unchecked.” Farrar-Owens
said that newly appointed judges are taught about the risk-assessment
program during orientation but that “judicial philosophy in regards to
risk assessment certainly varies across the Commonwealth.” Aurélie
Ouss, who researches criminal sentencing, recidivism and related
subjects at the University of Pennsylvania, praised Stevenson and
Doleac’s work, and said it showed Virginia’s algorithms had neither been
the panacea some hoped for nor the nightmare others feared. Like
Garrett, she said it might come down to implementation. “There
are so many ways in which you can present the information,” Ouss
pointed out. “It may be a case that a different tool that’s designed
differently — that judges use differently — would yield different
results.” Yet
in the case of both race and age, Stevenson and Doleac find that if
judges had fully complied with the algorithm’s recommendations, the
disparities would have been even greater. “If
you’re sentencing purely on a risk rationale, then you want to lock up
all the teenagers,” Stevenson said. “But if you’re sentencing on other
rationale like mercy for the vulnerable or evaluating the level of
culpability that individuals have, then people are young. There are a
lot of reasons why we don’t necessarily want to punish them more harshly
than older people.""