The China-U.S. Million Book Digitalization Project (CADAL) is a research and development project initiated by scientists from both China and the United States to build a digital library of 10,000 books. The project is based on an open framework structure, and the number of resources will reach 50-100TB. At present, the popular e-book production standards at home and abroad are basically incompatible with each other. The e-book formats released include text format, HTML format, and special open formats (such as PDF, DVu, etc.), special closed cells, etc., multiple formats will inevitably lead to repeated construction of resources and difficult interaction.

The Open eBook Forum (OpeneBookForum), established in 1998, is a standard-setting organization for e-books jointly initiated by the IT community, libraries, publishing organizations, and book sales companies. The e-book framework structure standard OEBPS 1.0 was released in 1999. (OpeneBookPublicationStructure1.0), whose purpose is to better express the content of e-books, that is, to provide authors and publishers with the simplest and most common e-book publishing format standard, and also to provide e-books to e-book reading system developers The standard of document structure, e-books produced according to this standard, can be accurately reproduced by various reading systems. The OEBPS standard is based on the XML system and has the advantages of interoperability, scalability, openness, and ease of use. As an open framework, the specification has been widely adopted by the industry. The latest version of OEBPS is OEBPS1.2 released in August 2002. After joint discussions between Chinese and American experts, CADAL selected OEBPS as the engineering implementation standard.

1 The structure of e-books based on OEBPS In the CADAL project, the production of an e-book has to go through the process of scanning, image correction, conversion format, metadata recording, directory recording, packaging, etc. The final semi-finished e-book data contains several OEB files, package files with extension Pf, and image files reflecting the contents of e-books.

The OEB specification stipulates that e-book documents that meet the requirements of the specification must be valid XML documents, and it is recommended that the package file use the Pf extension. This document describes the complete framework of an e-book, including documents, images, and other objects and their interrelationships.

According to the requirements of the specification, the file is composed of metadata, manifest, list, spine, tours, and guide. The following is an example of a specific OPF file: Feng Er 〃 as a standardized XML file, the outermost element is the package label DcMetadata element contains the DC metadata information of the e-book, each sub-element represents a DC field, the content of the sub-element It is the value of the DC field. For example, Shen Zongjing: Creatr> indicates that the author is Shen Zongjing.

The Manifest element contains a list of all the files that make up the e-book. The child element item is used to describe the file type and address, and each item has a unique ID number. For example, in the OEB specification, tours (guide) can be based on different reader levels or reading purposes, in a certain order, select some pages in the e-book to form a guide. However, the guided reading function provided by tours differs from the tree-like directory navigation requirements of CADAL e-books. Therefore, this element is not used in the CADAL e-schoolbag file, but the catalog data is directly written into the catalog, xml file. Enter directory information according to the directory structure of the book, and the generated Catalog and xml files follow the METS standard.

2 CADAL e-book format selection CADAL e-books first chose HTML as the publishing format, because HTML is the most common markup language on the Internet, readers can download and read through a browser, and there is no need to install other plug-ins. E-books using HTML format have strong versatility.

The content in HTML format is text that has been scanned, recognized and restored by OCR, so the recognition effect and reduction rate of OCR determine the quality of e-books to a large extent. During the production process, we found that OCR technology has a very low recognition rate for mixed text and Chinese and English layouts. At the same time, due to the lack of layout reduction technology, the generated HTML file is very different from the original layout. Therefore, the format of the e-book in HTML format is unsatisfactory.

In view of the many shortcomings of HTML format e-books, we noticed that the DVu electronic document algorithm proposed by ATT Labs in 1996 can solve the problems mentioned above. DVu is a new image compression technology developed by ATT Lab in 1996.

It is an open standard, file format specification, decoder implementation and type description HTML directory storage html files generated from tif files MARC directory storage MARC files META directory storage Dublincore files OEB directory storage pf files related to pf OTIFF directory storage Unprocessed scanned tif image files PTIFF directory stores processed scanned tif image files TOC directory stores directory files IMAGES directory stores images required for e-book display Oebbrowser.html file e-book home page browse files part of the decoder is open. There are free plug-ins for standard Internet browsers for various operating systems.

The typical DVu document compression rate is 5 to 10 times better than existing color documents such as PEG and GIF formats, and 3 to 8 times better than TIF format black and white documents. Scanned documents in 300DH full color can be compressed from 25MB to 30 to 100KB. The size of 300DPI black and white pages after compression is usually 5 to 30KB. The high-resolution scanned page can be controlled to the size of the HTML page (such as 50KB). For color documents with both pictures and text, DVu is 5 to 10 times smaller than PEG format under the same quality.

txt) file is browsed, the text selection button on the browser will be activated. At this time, press the text selection button, use the mouse to draw a rectangular frame on the corresponding text content in the image, the selected part is reversed. The selected text can be copied to the clipboard of the system, and the search and query of the text content in the image can be realized accordingly.

In view of the above advantages, CADV's latest e-book selects DVu as the publishing format, and uses the automatic OCR system to generate hidden text, which can not only reproduce the book layout style, but also meet the needs of retrieval and editing. HTML is only used as a navigation with DVu to facilitate readers to read.

3 CADAL e-book structure For the choice of publishing format, CADAL's e-book structure has also undergone two changes.

3.1 At the beginning of the e-book structure project in HTML format, each e-book has an independent directory, the name of the directory is an 8-digit serial number, and each directory contains multiple subdirectories and files.

The HTML directory stores a TIF file scanned according to each page in the PTIFF directory. After layout analysis, OCR recognition, and then the layout is restored to an HTML file, each page of TIF generates a corresponding HTML file; the OEB directory mainly stores The pf file conforming to the oebps standard is encapsulated according to the files in the DC and HTML directories.

3.2 DVu format e-book structure The disadvantage of HTML format e-book structure is that only the structure of a single e-book is considered, and a series of problems such as storage and distribution of e-book library are not considered. Therefore, on the basis of determining to select DVu as the format of the e-book, CADAL made corresponding modifications to the structure of the e-book in view of the above problems.

Based on the DVu e-book structure, it is based on a single e-book and aims at a huge e-book library. An e-book is just an element in this structure. Because the structure of the CADAL e-book library is much more complicated than that of HTML single e-books, for ease of explanation, the following is the directory structure of the CADAL e-book library described in XML: Note: here starts to store single e-books> production unit number> as above XML The displayed tree structure, the Name attribute of each XML element is the name of the directory built on the computer, and the Description attribute is a simple description of this directory. From the above example, we can see that the e-book is stored in the 06 directory of the element description, and the 06 directory is Each proect of an ebook is used as a category, and each f element can be repeated. For directory storage, it is possible to store multiple directories under ebook, and each directory is a category. Since CADAL is a million-level electronic Book library, so under each project directory, a numbered set directory was created, each numbered set directory stores 200 e-books, the numbered directory name is the first two numbers plus the middle "one" to make an interval. As described in the XML element above, the number set directory is 000001000200, which means that the e-books with numbers between 000001-2 are stored in this directory.

The above structure replaces the single e-book with the concept of e-book library, which is more convenient for the storage, distribution and management of e-books.

4 Looking forward to the implementation of the CADAL project will promote research work in massive data storage, management, retrieval and multimedia processing, and promote China to advance to the world advanced level in the field of large-scale digital library construction and information services. The study of e-book format and structure is just the most basic step. The e-book structure applied by CADAL not only complies with the OEB specification, but also absorbs the advantages of the DVu format, which meets the requirements of building an open digital library. We also hope that through this application, we can provide a simple, efficient, economical, and powerful e-book structure model for everyone in the construction of digital libraries.

Special PMDC Motor

Special PMDC Motor

Electric Vehicle DC motor, Electric tricycle DC motor, Electric tool DC motor, oil pump DC motor, ATV Motor, Golf cart DC motor. various solar tracker motors, glof cart dc motor. Window opening motor, door opening motor, Mower motor, Juicer motor, motor for swinming pool robotic cleanerSpecial Dc MotorBrushed Dc Motor Factory

Brushed DC Motor
PMDC Gear Motor
Special PMDC motor

Special Pmdc Motor,Pmdc Motor,Pmdc Electric Motor,Pmdc Geared Motor

Ningbo Biote Mechanical Electrical Co.,Ltd , https://www.biotept.com

Posted on