GSoC 2013 Project Idea Page

 

Improving text selection and rotation in PDF.js

Srishti

Email: srishsensation@gmail.com

Short description:

PDF.js component has been recently integrated into the Firefox browser. Where earlier rendering a pdf document required a third party software like Adobe Acrobat reader, this component brings a standard platform to parse and render PDF files within the web browser. The current proposal aims to improve the text layer in PDF.js. Some of the functionalities in the text layer is broken such as the improper behaviour of PDF.js during text selection, insertion of an extra ‘newline’ character in between the text when it is copy-pasted from PDF.js, rotation of pdf documents and highlighting of wrong text when searched using ‘find’. The project also involves improving text layer formatting like adding font styles italics, bold, h1 and height/width of the text in the document.

Personal Details

Name:Srishti Srivastava

Email:srishsensation@gmail.com

Telephone:+91 xxx-xxx-1234

Other contact methods:Srishti’s Blog

Country of residence:India

Timezone:+5:30 UTC

Primary language:English

Main Proposal

The text layer in PDF.js is built up from div elements. A div element is described to work as a container of data as it is a block level element. A block level element may contain other block level elements as well as inline elements. It defines a larger section of page than inline elements.
Generally block level elements are followed by a newline. Since the text layer in PDF.js is built from these elements, an extra newline is inserted at the end of each elements. This is the reason for improper behaviour of the text layer while selection, rotation, highlighting and copy-paste in PDF.js.

Issues related to the text layer:

Text Selection:

Wrong text selection: (issue #2769 ) The text selection layer creates an html div for each span of text. In some places the text layer div don’t necessarily match with the displayed pdf text, therefore when text is selected from the html page, it is present on relatively different location on the text layer, this causes wrong selection of text.

Rotation of document:

Rotation causes text layer to get messed up:(issue #2095) The pdf document when rotated right/left, the text in the document gets garbled.The reason being that text layer doesn’t take into account the rotation angle of the text being printed. It assumes the rotation_angle=0 DEG which is always not correct. On rotation the text layer collapses and text-runs from a specific symbol in the wrong direction.

Improper highlighting:(issue #2980) When the ‘find’ command is used in the Firefox browser, the text which gets selected is not the same as entered in the ‘find’ box. The reason for this issue is because in many places the text mapped on the text layer element is not the same as that displayed on the Pdf document. The details of which is mentioned here .

Copy/paste:

Text not in one line: (issue #2140) The text layer is built from div elements. These elements are at the block level. Therefore at the end of each block a ‘new line’ element is inserted automatically. When the text is selected from PDF.js it gets selected in form of many div ending with a new line character.

When this text is copied to another editor, only one div element appear in one line. This is the reason for the poor copy/paste behaviour of PDF.js. This is better elaborated in issue #2989. The issue in Bug 810636 will also be dealt here.

The structure of text layer is made up from many div elements. Since div is a block element it can only be used to wrap section of document. The issue faced in PDF.js is that small text fragments are enclosed under div elements. A block element is always followed by a ‘newline’ character. This is the reason that when a text is selected, or highlighted it is selected in form of one block followed by a newline.

  • In order to avoid such a situation we wrap small portions of text using span elements. Span are inline elements, hence it will take care of the issue of unwanted ‘newline’ appearing while pasting a text from PDF.js to an editor. Apart from that formatting of text can also be done efficiently using span tags.
  • Build the text layer structure from the top to handle the text fragment, the text-runs of same height/width/font should be merged together in form of span elements and a particular line of text should be enclosed under a single div element.
  • Implementation of the div element structure in a way that it will calculate text-runs from a particular point to the end point during selection. This will help in improving the selection issue by removing the extra newline character inserted at the end of each block element.
  • This implementation will create issue of poor spacing between the paragraphs, which can be fixed by detecting paragraph change and implementing <br> tag.
  • The improvement in the implementation of block and inline elements is to be done in text layer builder, by fixing append text.
  • The UI created for highlighting needs improvement. After fixing text layer the highlightDiv, which is handling the text highlighting, will need changes as highlighted text will be of span element form.Since now the text fragment is formed of span elements, the old implementation will be updated to work for inline element.
  • The canvas transformation matrix provides rotation information for the pdf files, this needs to be fixed as PDF.js only displays text correctly when the rotation angle=0 DEG, the same should work for rotation angle=90, 180, 270 DEG using matrix algorithm
  • The copy paste issue will be resolve once the text layer is fixed. Run test cases to check for the verification that the correct text is pasted on the clipboard.

Why Canvas and Why not SVG

SVG, Scalable Vector Graphics, implementation is not covered under this project. The project does not make use of svg backend, as Firefox does not support it. Implementation of SVG backend may not be as effective enough as implementation of Canvas backend, as the latter provides more control on how text is selected.One of the reasons why SVG backend is not applicable can be seen here .
Academic Experience

I am in my third year as an undergraduate in Computer Science Engineering at Amrita School of Engineering, Amritapuri, Kerala, India.

I have no commitments till the last week of July so I will be able to devote 7 to 8 hours to the project per day, but after that since I will also have academic engagements I will only be able to work for 5 to 6 hours on a daily basis. Apart from this I will be taking small break of 3 to 4 days in the mid of August and September as I will be having my college exams.

Why Mozilla

Mozilla is a pioneer in developing open source software. I envy Firefox, an open source web browser that is developed from ground up to support open internet standards across a variety of platforms and is one of the most popular browsers in the world. My first contribution to open source was in Mozilla. The team has been really helpful and friendly to me. I want to contribute to PDF.js to improve the implementation and issues of the component. PDF.js has is a vital component in Firefox. It has eliminated the need to download third-party softwares to view pdf documents in Firefox. With this it has made Firefox a complete web browser.

felipe's Blog

felipc traveling through the blogosphere

Minerva

through the lense of perception

Euphoria Reload3d

Journey towards the h1dd3ntru7h.......

FOR-BIN-SEC

Yet another blog by a security enthusiast!

The tempest of my soul

the thoughts of mine.....

T.Neha

Nobody can go back and start a new beginning, but anyone can start today and make a new ending.