Pencil Down

September 24, 2013 2 Comments

Hello all

This is going to be one of the last posts by me under the GSoC 2013 tag. The Google summer of Code program is coming to its end with today being the ‘strict pencil down‘ date. The project was about improving the textLayer of the PDF.js.

My work started from fixing the canvas methods to generate complete details about the text position, height , width, angle etc using formulas which are present in the project proposal. After that I worked on the textLayer code to make use of those information and place the textLayer divs above the canvas. I worked on the vertical text issue separately which needed a few more adjustments after the canvas – textLayer overlapping was over.

Then I started working on the generator code where I needed to implement various parser operators to get text position, angle and direction. This part of the project was the most important and most challenging part as the code needed to be built from the level zero. I had to read a lot of documentation to get the formulas to be used and to debug test pdfs to make sure all the operators required are implemented. I had to manually run test on all pdf documents present in master/test/pdfs/ to make sure the final code was good to go. When final code for this issue landed it was a big relief as implementation of this will improve the textLayer rendering and parsing easier. We can remove many lines of code from the canvas part of PDF.js now cause of the patch submitted.

The project is still in progress as I was not able to attend the issue related to text lines appearing in one single line when copy pasted, but because of the previous patch which was submitted, resolving this issue wont be hard now.

I have published the code on my gh-pages you can take a look if you go through my github account which is https://github.com/SSk123 or can directly try to view the published code from http://ssk123.github.io/pdf.js/

In the end I would want to thank from the bottom of my heart to my mentors Yury, Brendan and Bill. If it was not for you guys I would have not made these contributions. I really admire the patience with which my mentors have assisted and guided me and I would just want to say guys are really the coolest people I have talked to and it was such a pleasure and honor to work and learn under you. I look forward to make more contribution to PDF.js and Mozilla as a whole.

Signing off .. 🙂

Filed under TecQi Tagged with Firefox, GSoC 2013, PDF.js

Text Extraction Code for the TextLayer

September 23, 2013 Leave a comment

Hello all

The TextLayer building through the extract code needed implementation of certain formulas. As mentioned in the previous post Td, TD, Q, q, cm, BT and a few other operators needed to be implemented to do the job. The formula for a few of these operators can be seen below.

BT operator

TD and Td operators

Tm and T* operators

Q, q, cm operator

The formula employed to calculate the position of rendered text is in form of matrix the formula for which is given below

The above formulas lay the ground work on how the text parser operators will work.

The implementation and details on code base can be seen here.

Till then happy coding 🙂

Filed under TecQi Tagged with Firefox, GSoC 2013, PDF.js

Extract Code working on TextLayer

September 8, 2013 Leave a comment

Hello all

AS per the time line for my project the next job is to implement the TextLayer by implementing the transform matrix to the getTextContent code that is the extract code.

Before getting more into it I would like to show how our pdf document looks to the parser parsing it. Its really amazing how the pdf document viewed on a viewer is so many lines of code as the parser parses it.

A pdf document can be can be uncompressed using pdftk toolkit. Download/install the latest version of pdftk form pdftk-download

The command line for uncompressing a pdf-document in terminal is:

pdftk doc.pdf output doc.unc.pdf uncompress

Our job is to implement the operators like TD, Td, T*, cm, q, Q, Tm etc in our getTextContent code to get the transformation matrix from the parser.

The function of operator was taken from the reference pdf PDF 32000-1:2008

The implementation of the transform matrix is still in progress, the diff file and other details will be added to my blog shortly.

Till then Happy Coding 🙂

Filed under TecQi Tagged with Fire, GSoC 2013, PDF.js

Vertical Text Fix for PDF.js

September 8, 2013 Leave a comment

Hello all I have been a lot behind schedule in updating about my work on my current GSoC project. My sincere apologies.

After me and my mentors established a good interaction between the Canvas and our div textLayer floating over the canvas we realized that we missed behind a small fact that their was still a small portion of the textLayer which was not in its place — The rotated/vertical text.

So our next job was to edit the canvas function createTextGeometry such that it handles the text at different angles. Initially we were focused on fixing only CJK (Chinese/Japanese/Korean) text which are written vertically, so our job was to just rotate the text by 90DEG, but then we created a sample pdf document with text of angles 45DEG, 20DEG, 90DEG. Our job was to create a fix which works not just for CJK text but for all sorts of rotated text and the textLayer coincides with the Canvas even when our Page is rotated by (n * 90)DEG

So the work was to position the text as per it is visible on the canvas. The diff for the above given problem can be found here .

Our next job is to fix the textLayer such that it is implemented through the parser reading the pdf document.

Details of which I will be updating in the coming blog.

Till then Happy Coding 🙂

Filed under TecQi Tagged with Firefox, GSoC 2013, PDF.js

Resolving the messed up PDF while rotation

June 28, 2013 Leave a comment

Hello, sorry but I have been little behind the schedule on updating my progress on my GSOC project. Well as mentioned in the previous blog me and my mentors have decided to take up one of the major issues first and so I started working on fixing the rotation issue #2095.

The text in a pdf is painted on the canvas of PDF.js but in order to select ,copy and select rotated text from the pdf, the PDF.js team has come up with a great innovative idea. They have created textLayer Div elements float all over the canvas. Now this textLayer does our job easier. As assumed this is a lot of work to create an entire different textLayer to do the job, but the end result is amazing.

Already PDF.js has made lives easier 🙂 , there is this issue where the mounted textLayer is floating a little not as desired, and that is where I am coming to Rescue 😉 .

The issue of messed up text rotation is because of this reason that the textLayer is not overlapping on the canvas and things go real bad when we start rotating the canvas.

Below is a small sample of our tragedy.

Well I have been busy with the code trying to find out why are things not working out between the canvas and the textLayer, and I found that our ‘dear’ canvas has not been truthful enough the ‘poor’ textLayer, but I must say ‘It was strictly platonic’ 😛

The canvas function needed to be incorporated with angle functionalities which would deal with situations where rotation or rotated text were treated as expected. After doing that the textLayer need to be informed about those values and transformation of the textLayer has to be done accordingly.

Well this week I sent a pull request consisting of around 26 commits, which I squashed to 1 commit (so that it looks good :-P), and the textLayer is looking pretty good for rotated text now. Here is a sample of the same.

The pull request with the diff file can be seen here –> DIFF FILE

Till the end of this week Yury (my coolest mentor) asked me to learn how I can run test and add them in the test_manifest.json. So probably that is goal for now before the pull request can be merged with the master.

Well to sum-up I will tell you about my mentor, Yury he is such a great guy and has been so supportive. Hes one of the most coolest Mozilla folks (including my other mentors) . I am not a developer and I did not have enough experience with the language in the past, but Yury has always been patient and ready to help.

I am hoping a great learning experience working with him.

Till then Happy Coding 🙂

Filed under TecQi Tagged with GSoC 2013, PDF.js

GSoC Community Bonding is ON !!

June 8, 2013 Leave a comment

Hello all,

First of all let me officially say it ‘I am a SoCian‘ 🙂 Congrats to all other 20 members who got selected for GSoC and are helping Mozilla with there projects. The list of selected Mozilla student helpers is here.

The first half of June is dedicated to the Community Bonding period by the GSoC, the time to get to know your mentor and discuss the project with them. Well my project is Improving the text selection and rotation in PDF.js. The project is as the name says is a JavaScript project. My mentors Yury, Bill and Brendan have discussed and sorted out a plan to go about the project.

As my project proposal didn’t include a project time-line we decided to make one for starters and came up with this.

We have decided to work on the major issues first that is the text rotation issue #2095.

Hope all goes well 🙂 All the best to all participating in GSoC2013.

Filed under TecQi Tagged with GSoC 2013, PDF.js

Linting Js files for PDF.js

May 31, 2013 1 Comment

I was trying to run node make lint to lint JavaScript files for PDF.js and found an error saying ‘jshint not installed’. I went to the Mozilla PDF.js contributing wiki page

https://github.com/mozilla/pdf.js/wiki/Contributing#-4-run-lint-and-testing

which is a wiki on how to Run Lint test on PDF files.

The steps involved installing syntastics if you are a VIM user, and since I am a Linux lover VIM is my favorite editor.

I faced a few problems while installing syntastics through the wiki page, so I did a little Google search and came a with a solution to fix the issue.

-> To install syntastics for VIM you need to first install pathogen.vim

Open your Terminal and type the command

mkdir -p ~/.vim/autoload ~/.vim/bundle && \
curl -LSso ~/.vim/autoload/pathogen.vim https://tpo.pe/pathogen.vim

create a ~/.vimrc file if you don’t have one and add the following command to it

execute pathogen#infect()

Save and quit (:wq in command line)

-> Now you can install syntastics as a pathogen bundle

Go to the directory

$ cd ~/.vim/bundle

Then clone syntactic using the following command in the Terminal

$ git clone https://github.com/scrooloose/syntastic.git

-> Close all existing VIM editor and open VIM in the Terminal

Type :Helptags in the command line

If you get an error do the above steps again.

The above were the steps to install syntastics now we can look at how to install jshint

If you have node.js properly installed in your system you can directly install jshint by the following command

$ npm install jshint

And hence we are DONE.

Filed under TecQi Tagged with GSoC 2013, PDF.js

GSoC 2013 Project Idea Page

May 24, 2013 Leave a comment

Improving text selection and rotation in PDF.js

Srishti

Email: srishsensation@gmail.com

Short description:

PDF.js component has been recently integrated into the Firefox browser. Where earlier rendering a pdf document required a third party software like Adobe Acrobat reader, this component brings a standard platform to parse and render PDF files within the web browser. The current proposal aims to improve the text layer in PDF.js. Some of the functionalities in the text layer is broken such as the improper behaviour of PDF.js during text selection, insertion of an extra ‘newline’ character in between the text when it is copy-pasted from PDF.js, rotation of pdf documents and highlighting of wrong text when searched using ‘find’. The project also involves improving text layer formatting like adding font styles italics, bold, h1 and height/width of the text in the document.

Personal Details

Name:Srishti Srivastava

Email:srishsensation@gmail.com

Telephone:+91 xxx-xxx-1234

Other contact methods:Srishti’s Blog

Country of residence:India

Timezone:+5:30 UTC

Primary language:English

Main Proposal

The text layer in PDF.js is built up from div elements. A div element is described to work as a container of data as it is a block level element. A block level element may contain other block level elements as well as inline elements. It defines a larger section of page than inline elements.
Generally block level elements are followed by a newline. Since the text layer in PDF.js is built from these elements, an extra newline is inserted at the end of each elements. This is the reason for improper behaviour of the text layer while selection, rotation, highlighting and copy-paste in PDF.js.

Issues related to the text layer:

Text Selection:

Wrong text selection: (issue #2769 ) The text selection layer creates an html div for each span of text. In some places the text layer div don’t necessarily match with the displayed pdf text, therefore when text is selected from the html page, it is present on relatively different location on the text layer, this causes wrong selection of text.

Rotation of document:

Rotation causes text layer to get messed up:(issue #2095) The pdf document when rotated right/left, the text in the document gets garbled.The reason being that text layer doesn’t take into account the rotation angle of the text being printed. It assumes the rotation_angle=0 DEG which is always not correct. On rotation the text layer collapses and text-runs from a specific symbol in the wrong direction.

Improper highlighting:(issue #2980) When the ‘find’ command is used in the Firefox browser, the text which gets selected is not the same as entered in the ‘find’ box. The reason for this issue is because in many places the text mapped on the text layer element is not the same as that displayed on the Pdf document. The details of which is mentioned here .

Copy/paste:

Text not in one line: (issue #2140) The text layer is built from div elements. These elements are at the block level. Therefore at the end of each block a ‘new line’ element is inserted automatically. When the text is selected from PDF.js it gets selected in form of many div ending with a new line character.

When this text is copied to another editor, only one div element appear in one line. This is the reason for the poor copy/paste behaviour of PDF.js. This is better elaborated in issue #2989. The issue in Bug 810636 will also be dealt here.

The structure of text layer is made up from many div elements. Since div is a block element it can only be used to wrap section of document. The issue faced in PDF.js is that small text fragments are enclosed under div elements. A block element is always followed by a ‘newline’ character. This is the reason that when a text is selected, or highlighted it is selected in form of one block followed by a newline.

In order to avoid such a situation we wrap small portions of text using span elements. Span are inline elements, hence it will take care of the issue of unwanted ‘newline’ appearing while pasting a text from PDF.js to an editor. Apart from that formatting of text can also be done efficiently using span tags.

Build the text layer structure from the top to handle the text fragment, the text-runs of same height/width/font should be merged together in form of span elements and a particular line of text should be enclosed under a single div element.

Implementation of the div element structure in a way that it will calculate text-runs from a particular point to the end point during selection. This will help in improving the selection issue by removing the extra newline character inserted at the end of each block element.

This implementation will create issue of poor spacing between the paragraphs, which can be fixed by detecting paragraph change and implementing <br> tag.

The improvement in the implementation of block and inline elements is to be done in text layer builder, by fixing append text.

The UI created for highlighting needs improvement. After fixing text layer the highlightDiv, which is handling the text highlighting, will need changes as highlighted text will be of span element form.Since now the text fragment is formed of span elements, the old implementation will be updated to work for inline element.

The canvas transformation matrix provides rotation information for the pdf files, this needs to be fixed as PDF.js only displays text correctly when the rotation angle=0 DEG, the same should work for rotation angle=90, 180, 270 DEG using matrix algorithm

The copy paste issue will be resolve once the text layer is fixed. Run test cases to check for the verification that the correct text is pasted on the clipboard.

Why Canvas and Why not SVG

SVG, Scalable Vector Graphics, implementation is not covered under this project. The project does not make use of svg backend, as Firefox does not support it. Implementation of SVG backend may not be as effective enough as implementation of Canvas backend, as the latter provides more control on how text is selected.One of the reasons why SVG backend is not applicable can be seen here .
Academic Experience

I am in my third year as an undergraduate in Computer Science Engineering at Amrita School of Engineering, Amritapuri, Kerala, India.

I have no commitments till the last week of July so I will be able to devote 7 to 8 hours to the project per day, but after that since I will also have academic engagements I will only be able to work for 5 to 6 hours on a daily basis. Apart from this I will be taking small break of 3 to 4 days in the mid of August and September as I will be having my college exams.

Why Mozilla

Mozilla is a pioneer in developing open source software. I envy Firefox, an open source web browser that is developed from ground up to support open internet standards across a variety of platforms and is one of the most popular browsers in the world. My first contribution to open source was in Mozilla. The team has been really helpful and friendly to me. I want to contribute to PDF.js to improve the implementation and issues of the component. PDF.js has is a vital component in Firefox. It has eliminated the need to download third-party softwares to view pdf documents in Firefox. With this it has made Firefox a complete web browser.

Filed under TecQi Tagged with GSoC 2013, Mozilla