The EPFL Library focuses on reading comfort and offers its users the possibility to perform full text searches of its documents, using transcripts obtained by optical character recognition (OCR). It is easy to navigate through a document thanks to the OCR and dynamic tables of contents.
Settings of digitization for the original documents
- The archiving format is a TIFF file without compression, 300 DPI, 24 bits or 16 million colors in RGB, images are not straightened, and without alignment correction.
- Grey level processing is applied to documents with transparency for an improvement of the OCR. The images are straightened in the reading direction.
- The original documents are scanned in full size at a scale of 1:1.
- Digitization is a single page process with some exceptions. The picture is taken on a double page, if the image occupies both pages, and if it is possible to position the document in a flat way.
- The framing leaves between 2 and 5 mm of margin around the cover, in order to keep the shape of the document. In the case of "tight books" (with margins and text immersed in the binding), the delivery of images is carried out with a negative margin of 2 to 3 mm.
- Scanning includes all the elements of the document: cover, binding, blank page, flap.
Availability settings for digitized documents
- The consultation format is an OCR delivered in PDF/A hidden text format, 300 DPI, PDF 30 (70% quality), grayscale, images straightened in the reading direction, with alignment correction.
- Provision of images in JPEG 70 format (70% quality), 30% compression, 24 bits or 16 million colors (true colors) in RGB, images rectified in the reading direction, with correction of alignment. This JPEG format is always produced from the TIFF file, 300 DPI, RGB, images straightened in the reading direction with alignment correction.
- Creation of structure metadata internal to documents, such as pagination, foliation or intellectual structure (chapters or parts for example).