GIST: "We
 tend to assume the near-term future of automation will be built on 
man-machine partnerships. Our robot sidekicks will compensate for the 
squishy inefficiencies of the human brain, while human judgment will 
sand down their cold, mechanical edges. But
 what if the partnership, especially in the early stages, instead 
accentuates the flaws of both? For example, a formula designed to reduce
 prison populations in Virginia led some judges to impose harsher 
sentences for young or black defendants, and more lenient ones for 
rapists. In an age when artificial intelligence is widely expected to 
eat what’s left of the world, simple sentencing algorithms are a preview of the economy’s tool-assisted future. A 
working paper released Tuesday
 by Megan Stevenson of George Mason University and Jennifer Doleac of 
Texas A and M provides one of the first examinations of the unintended 
consequences that arise when algorithms and humans team up in the wild. The
 algorithms are intended to remove some of the guesswork from judges’ 
sentencing decisions by assigning a simple risk score to defendants. In 
Virginia, the score included data such as offense type, age, prior 
convictions and employment status. Larceny scores higher than drug 
offenses, men score higher than women, and unmarried folks score higher 
than their married peers. (Marriage and employment were removed from the
 score in 2013.) Similar
 algorithms are now used in 28 states (and parts of seven more). In 
Virginia, they were adopted statewide in 2002 to help keep prison 
populations down after discretionary parole was abolished. Stevenson and
 Doleac’s analysis relies on tens of thousands of felony convictions, 
with a particular focus on the period between 2000 and 2004. Judges
 were supposed to use risk scores to identify felons who were least 
likely to reoffend and either give them shorter jail sentences or send 
them to a program such as probation or substance-abuse treatment. Rather
 than focus on patterns of discrimination by 
algorithms or 
judges acting alone, as others have done, Stevenson and Doleac measured how the two interacted. After
 controlling for the effects of time, geography and demography, 
Stevenson and Doleac found the ambitious new nonviolent risk assessment 
system, often held up as a model for other states to follow, did not 
change the rate at which people were incarcerated, the length of their 
sentences or the rate at which they reoffended after release. But that 
doesn’t mean it had no effect. Judges
 followed the algorithm’s suggestions a bit less than half of the time. 
People who the algorithm deemed high-risk received longer sentences than
 they would have, and candidates assessed as low-risk got shorter ones. 
The two adjustments offset each other, so the overall numbers didn’t 
change, but the interaction between algorithmic sentencing 
recommendations and judges’ discretion nonetheless produced perverse 
effects. In
 a statement, Meredith Farrar-Owens, director of the Virginia Criminal 
Sentencing Commission, said the system’s goal was to avoid increasing 
crime or recidivism rates while diverting the lowest-risk nonviolent 
offenders from prison and freeing up space for violent offenders, who 
were expected to serve longer terms under the sentencing reforms that 
took effect in 1995. “While
 the Commission’s risk assessment instruments were developed based on 
empirical study of recidivism rates and patterns among Virginia felons, 
use of risk assessment takes place within the context of a complex and 
dynamic criminal justice system,” Farrar-Owens said. “Risk assessment 
recommendations are advisory and only one of many factors judges will 
consider when sentencing defendants.” She
 added that judges’ ability to follow the algorithm’s suggestions may 
also have been limited by plea agreements, and by a lack of alternative 
options in their particular circuit. The
 sentencing of young offenders is one example of the algorithm’s 
surprising side effects. Stevenson said some judges may not realize it, 
but someone’s risk score is largely a reflection of how old they are. 
That’s because age is such a strong predictor of recidivism. You get 
substantially more added to your risk score for being younger than 30 
(13 points) than, for example, having been incarcerated five or more 
previous times as an adult (9 points). In a 
2018 analysis,
 Stevenson and Vanderbilt’s Christopher Slobogin calculated that 58 
percent of the widely used COMPAS algorithm’s violent recidivism risk 
score can be attributed to age. “People
 are getting this very stigmatic label — high risk for violent 
recidivism — largely because they’re 19 years old,” she said. Judges
 tended to be more merciful toward young defendants than the algorithm 
recommended. Nonetheless, defendants younger than 23 were 4 percentage 
points more likely to be incarcerated after risk assessment was adopted,
 and their sentences were 12 percent longer than their older peers. “Based
 on the Commission’s recidivism study, age is one of the most heavily 
weighted factors on the risk assessment tool. It makes sense that, once 
judges had an objective, research-based risk assessment tool, we would 
see sentences of young adult offenders increase relative to older 
offenders,” Farrar-Owens said. Racial
 disparities also increased among those circuits that used risk 
assessment most. Although computers can’t explicitly use prohibited 
variables like race in their sentencing calculations, black defendants 
were 4 percentage points more likely to be incarcerated after risk 
assessment was adopted, compared with otherwise equivalent whites. Black
 defendants’ sentences were also 17 percent longer. “This
 is partially explained by the fact that black defendants have higher 
risk scores, and partially because black defendants are sentenced more 
harshly than white defendants with the same risk score,” Stevenson and 
Doleac write. The
 authors also studied a similar risk-assessment program for sex 
offenders. Out of an abundance of caution, the program was built so the 
algorithm could only authorize longer-than-baseline sentences — yet its 
net effect was to decrease how often sex offenders were imprisoned by 5 
percentage points, and to shorten their sentences by about 24 percent. Stevenson
 and Doleac suggest that by assigning a sex offender a low risk score, 
algorithms may help protect a judge from backlash if the offender goes 
on to commit another crime. This empowers them to offer shorter 
sentences than they otherwise would have meted out. “If
 they sentence someone leniently and that person goes out and commits a 
heinous crime, all fingers are pointed at them,” Stevenson said. “If 
they make a mistake in the other direction — failing to release someone 
who would have done anything if released — nobody sees that. There are 
no consequences to the judge.” The University of Maryland’s Frank Pasquale, who made the case for oversight of algorithms in 2015’s “
The Black Box Society,”
 said most judges aren’t adequately trained to evaluate the claims made 
by sentencing systems such as the one in Virginia. Because judges have 
the option to reject the algorithm’s conclusion, Pasquale said, they may
 follow it only when it provides a convenient excuse. “You
 can have a lot of scenarios where the AI algorithms end up being a 
rationalization for what the judge wants to do,” Pasquale said. Stevenson
 and Doleac write that, especially when the goals of the algorithm’s 
human partners differ from those of its designers, we should expect 
unexpected results. “Virginia’s
 nonviolent risk assessment reduced neither incarceration nor 
recidivism; its use disadvantaged a vulnerable group (the young); and 
failed to reduce racial disparities,” Stevenson and Doleac write. 
“Virginia’s sex offender risk assessment lowered sentences for those 
convicted of rape: a group that the Sentencing Commission had targeted 
for increased sentences.” Why
 did the Virginia algorithm struggle? It turns out future crime is hard 
to predict. “Even under ideal conditions,” Stevenson and Doleac write, 
“predictions of future offending are unable to explain more than a tiny 
fraction of the variation in recidivism.” Duke
 University professor Brandon Garrett, who has separately studied the 
Virginia system, said: “You can’t just adopt a tool and expect it to 
magically solve problems. You have to put real thought and information 
into implementation.” Garrett
 and his collaborators spoke with judges from across Virginia. Some said
 they weren’t trained or didn’t trust the formula. Some said they didn’t
 even have the programs necessary to divert people from prisons. One 
judge even dryly equated using the algorithm to visiting a psychic. In
 an analysis forthcoming in the California Law Review, Garrett and 
collaborator John Monahan of the University of Virginia write that 
judges didn’t follow risk assessments consistently — and that some 
didn’t use them at all. “Nor
 should that be a surprise,” they write, “given that judges and other 
decisionmakers typically receive almost no training in risk assessment, 
and their discretion to ignore risk assessment is virtually unchecked.” Farrar-Owens
 said that newly appointed judges are taught about the risk-assessment 
program during orientation but that “judicial philosophy in regards to 
risk assessment certainly varies across the Commonwealth.” Aurélie
 Ouss, who researches criminal sentencing, recidivism and related 
subjects at the University of Pennsylvania, praised Stevenson and 
Doleac’s work, and said it showed Virginia’s algorithms had neither been
 the panacea some hoped for nor the nightmare others feared. Like 
Garrett, she said it might come down to implementation. “There
 are so many ways in which you can present the information,” Ouss 
pointed out. “It may be a case that a different tool that’s designed 
differently — that judges use differently — would yield different 
results.” Yet
 in the case of both race and age, Stevenson and Doleac find that if 
judges had fully complied with the algorithm’s recommendations, the 
disparities would have been even greater. “If
 you’re sentencing purely on a risk rationale, then you want to lock up 
all the teenagers,” Stevenson said. “But if you’re sentencing on other 
rationale like mercy for the vulnerable or evaluating the level of 
culpability that individuals have, then people are young. There are a 
lot of reasons why we don’t necessarily want to punish them more harshly
 than older people.""