How the Internet Archive Digitizes 3,500 Books a Day–the Hard Way, One Page at a Time

Does turn­ing the pages of an old book excite you? How about 3 mil­lion pages? That’s how many pages Eliza Zhang has scanned over her ten years with the Inter­net Archive, using Scribe, a spe­cial­ized scan­ning machine invent­ed by Archive engi­neers over 15 years ago. “Lis­ten­ing to 70s and 80s R&B while she works,” Wendy Hana­mu­ra writes at the Inter­net Archive blog, “Eliza spends a lit­tle time each day read­ing the dozens of books she han­dles. The most chal­leng­ing part of her job? ‘Work­ing with very old, frag­ile books.”

The frag­ile state and wide vari­ety of the mil­lions of books scanned by Zhang and the sev­en­ty-or-so oth­er Scribe oper­a­tors explains why this work has not been auto­mat­ed. “Clean, dry human hands are the best way to turn pages,” says Andrea Mills, one of the lead­ers of the dig­i­ti­za­tion team. “Our goal is to han­dle the book once and to care for the orig­i­nal as we work with it.”

Rais­ing the glass with a foot ped­al, adjust­ing the two cam­eras, and shoot­ing the page images are just the begin­ning of Eliza’s work. Some books, like the Bureau of Land Man­age­ment pub­li­ca­tion fea­tured in the video, have myr­i­ad fold-outs. Eliza must insert a slip of paper to remind her to go back and shoot each fold-out page, while at the same time inputting the page num­bers into the item record. The job requires keen con­cen­tra­tion.

If this expe­ri­enced dig­i­tiz­er acci­den­tal­ly skips a page, or if an image is blur­ry, the pub­lish­ing soft­ware cre­at­ed by our engi­neers will send her a mes­sage to return to the Scribe and scan it again.

It’s not a job for the eas­i­ly bored; “It takes con­cen­tra­tion and a love of books,” says Inter­net Archive founder Brew­ster Kahle. The painstak­ing process allows dig­i­tiz­ers to pre­serve valu­able books online while main­tain­ing the integri­ty of phys­i­cal copies. “We do not dis­bind the books,” says Kahle, a method that has allowed them to part­ner with hun­dreds of insti­tu­tions around the world, dig­i­tiz­ing 28 mil­lion texts over two decades. Many of those books are rare and valu­able, and many have been deemed of lit­tle or no val­ue. “Increas­ing­ly,” writes the Archive’s Chris Free­land, “the Archive is pre­serv­ing many books that would oth­er­wise be lost to his­to­ry or the trash bin.”

In one exam­ple, Free­land cites The dic­tio­nary of cos­tume, “one of the mil­lions of titles that reached the end of its pub­lish­ing life­cy­cle in the 20th cen­tu­ry.” It is also a work cit­ed in Wikipedia, a key source for “stu­dents of all ages… in our con­nect­ed world.” The Inter­net Archive has pre­served the only copy of the book avail­able online, mak­ing sure Wikipedia edi­tors can ver­i­fy the cita­tion and researchers can use the book in per­pe­tu­ity. If look­ing up the def­i­n­i­tion of “pet­ti­coat” in an out-of-print ref­er­ence work seems triv­ial, con­sid­er that the Archive dig­i­tizes about 3,500 books every day in its 18 dig­i­ti­za­tion cen­ters. (The dic­tio­nary of cos­tume was iden­ti­fied as the Archive’s 2 mil­lionth “mod­ern book.”)

Libraries “have been vital in times of cri­sis,” writes Alis­tair Black, emer­i­tus pro­fes­sor of Infor­ma­tion Sci­ences at the Uni­ver­si­ty of Illi­nois, and “the coro­n­avirus pan­dem­ic may prove to be a chal­lenge that dwarfs the many episodes of anx­i­ety and cri­sis through which the pub­lic library has lived in the past.” A huge part of our com­bined glob­al crises involves access to reli­able infor­ma­tion, and book scan­ners at the Inter­net Archive are key agents in pre­serv­ing knowl­edge. The col­lec­tions they dig­i­tize “are crit­i­cal to edu­cat­ing an informed pop­u­lace at a time of mas­sive dis­in­for­ma­tion and mis­in­for­ma­tion,” says Kahle. When asked what she liked best about her job, Zhang replied, “Every­thing! I find every­thing inter­est­ing…. Every col­lec­tion is impor­tant to me.”

The Inter­net Archive offers over 20,000,000 freely down­load­able books and texts. Enter the col­lec­tion here.

Relat­ed Con­tent: 

Libraries & Archivists Are Dig­i­tiz­ing 480,000 Books Pub­lished in 20th Cen­tu­ry That Are Secret­ly in the Pub­lic Domain

10,000 Vin­tage Recipe Books Are Now Dig­i­tized in The Inter­net Archive’s Cook­book & Home Eco­nom­ics Col­lec­tion

Clas­sic Children’s Books Now Dig­i­tized and Put Online: Revis­it Vin­tage Works from the 19th & 20th Cen­turies

Josh Jones is a writer and musi­cian based in Durham, NC. Fol­low him at @jdmagness


by | Permalink | Comments (0) |

Sup­port Open Cul­ture

We’re hop­ing to rely on our loy­al read­ers rather than errat­ic ads. To sup­port Open Cul­ture’s edu­ca­tion­al mis­sion, please con­sid­er mak­ing a dona­tion. We accept Pay­Pal, Ven­mo (@openculture), Patre­on and Cryp­to! Please find all options here. We thank you!


Leave a Reply

Quantcast
Open Culture was founded by Dan Colman.