PDF to ePub conversion is a lot of hard work. The task of PDF to reflowable ePub3 conversion can get even more daunting when eBooks are written in native languages. It is a lot of sweat for the type designer and linguist expert, who have to work in perfect sync all along to put the character coding pieces of the puzzle together.
Our Kitaboo team was able to successfully convert PDF ebooks to ePub3 at the rapid speed of close to 100 non-English book conversions per week, while other were doing 6 books in a month. This is Part 1 of our Digital Publishing series blog that talks about the major obstacles you should expect to hit you when running on a similar conversion route with PDF books written in local script languages.
With millions of pages of ePub3 conversions already done with our cloud publishing platform Kitaboo, creating multiple volumes of eBooks written in native Indian languages from PDF files was an instant, “Yes we can!”
No sooner we started than we realized that our time taken for each PDF to ePub conversion calculations had to be put on reset mode. And, what followed was a journey of finding the shortest, fastest and most accurate route of ePub conversion to deliver on time lines.
The two major hurdles that were impacting our eBooks conversion speed was:
- PDF is a character-position driven document and it defines each character by an X &Y axis coordinates only, whereas, ePub3 depends on character sequence/order for creation of eBooks.
- Character encoding and font shaping had to be matched to render the linguistically correct reading order in ePub3. Thousands of errors appeared during the eBook conversion, and we realized it was a long road ahead, that too in the reverse gear, to untangle the character representation problems in HTML5.
The character tantrums had to be disciplined and put in order. What followed was a journey of in-depth complex script analysis, prediction of character behavior, font shaping & matchmaking in reflowable ePub3. The team had to manually look-up for errors word-by-word, carry-out corrections on each page and proofread the final pages all along the conversion route.
A broad breakdown of major hurdles that were faced included:
- PDF character encoding: Our first speed-breaker was the show-up of multiple aberrations in the character encoding order for the local Indian languages during PDF to ePub3 conversion. The PDF did not recognize the sequence and order of characters in the words/sentences which was a prerequisite for reflowable ePub3. This led to many mismatch errors that had to be looked into in detail with character-by-character sequence/order analysis.
- Character mismatch: The native language font had many pre-conditions to the character sequencing with other forms of consonants, vowels, and ligatures that had to be manipulated to sync the logical order and visual order of the text. The linguistic, phonetic and graphical order was incorrect in many words/sentences representation.
- Font/character mapping: The shaping features of characters, its composition and decomposition were inconsistent and had to be constantly monitored for the all the book pages with respect to the universal shaping engine. Continuous re-testing had to be done to make sure the font rules & specifications were running successfully.
- Manual proof-reading: A large number of errors in characters encoding made the ePub conversion process very slow and time-taking process with high manual dependence for proof-reading. To makes the eBooks error-free reflowable ePub3 was taking days for the team, going back and forth character-wise validation manually. Even OCR did not give a 100 percent error-free document and required manual re-checking.
- Formatting errors: There were many challenges with the overall synchronized reading order, positioning of tables, super-script, sub-script, header footer format and images layout.
The speed we were at was absolutely unviable and we had to think of a faster way that could accelerate our performance without compromising on the quality.
The result – an innovative character encoding tool was developed by the Kitaboo team to support automation of eBooks in native languages.
Books written in native languages needs ePub super specialists to solve the character encoding maze. Kitaboo team – a pro at eBook publishing was successful in solving this cumbersome task for the worlds’ largest online book publisher crashing their time to market by more than 70 percent.