Archive for the 'How-To's' Category

Time to crawl back! Download Google Groups using a crawler

For a research project, I needed an archive of a Google Group forum. Unfortunately, I had never been a member of the group and thus had not received any group messages. I tried to find a way to download an archive using the site but couldn’t find any.

However, Google Groups forums offers an original version of the message (Click on "More options" on the right of a message, then on "Show Original"). There is even a link which allows to get the original message source (Click on "Show only message text" on the page opened before). I needed a way to collect these pages and save them to a file.

The program I found is called Web-Harvest and calls itself an "Open Source Web Data Extraction tool". It allows to control the behavior of the crawler using XML scripts and Java expressions (it uses BeanShell internally), and to extract values (for example to follow links) within a page with XPath, XSLT, Regular Expressions, XQuery or even Java code. So I started setting up a script, and here is what came out:

Instructions

What you need to download Google Groups forum messages:

  1. Download, install and run Web-Harvest (it requires Java). On most platform you should be able to start it by double-clicking on the .jar file.
  2. Copy & paste the following script into the editor (UPDATE: The script is also available at http://pastebin.com/Xe6f4s9s):
    1. <!–
    2.     A Web-Harvest script that crawls through the pages of a Google Group forum
    3.     and saves the messages as an mbox file.
    4.    
    5.     By: Nils Kaiser (blog.nils-kaiser.de)
    6.     This work is licensed under the Creative Commons Attribution 3.0 Unported License.
    7.     To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/legalcode
    8.     Please mention my name and blog address as above if you reuse or distribute this work.
    9.    
    10.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    11.  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    12.  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    13.  ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
    14.  LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
    15.  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
    16.  SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
    17.  INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
    18.  CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
    19.  ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
    20.  POSSIBILITY OF SUCH DAMAGE.    
    21. –>
    22.  
    23. <config charset="UTF-8">
    24.  
    25.     <!– *** EDIT: set Google Group forum to crawl (main page of a discussion
    26.                make sure that the params gvc=2 and hl=en are used –>
    27.     <var-def name="discussionMainUrl" overwrite="false">http://groups.google.com/group/youtube-api-gdata/topics?hl=en&amp;gvc=2</var-def>
    28.    
    29.     <!– *** EDIT: set output file name –>
    30.     <var-def name="outputFile" overwrite="false">c:/output/output.mbox</var-def>
    31.  
    32.  <!–
    33.   function download-multipage-list, removed typos
    34.   adapted from http://web-harvest.sourceforge.net/samples.php?num=0
    35.   Copyright © 2006 by vnikic at users.sourceforge.net
    36.  –>
    37.  <function name="download-multipage-list">
    38.      <return>
    39.          <while condition="${pageUrl.toString().trim().length() > 0}" maxloops="${maxloops}" index="i">
    40.              <empty>
    41.                  <var-def name="content">
    42.                      <html-to-xml>
    43.                          <http url="${pageUrl}"/>
    44.                      </html-to-xml>
    45.                  </var-def>
    46.  
    47.  
    48.                  <var-def name="nextLinkUrl">
    49.                      <xpath expression="${nextXPath}">
    50.                          <var name="content"/>
    51.                      </xpath>
    52.                  </var-def>
    53.  
    54.                  <var-def name="pageUrl">
    55.  
    56.       <case>
    57.        <if condition="${nextLinkUrl.toString().length() > 0}">
    58.         <template>${sys.fullUrl(pageUrl, nextLinkUrl.toString().trim())}</template>
    59.        </if>
    60.       </case>
    61.                  </var-def>
    62.              </empty>
    63.  
    64.              <xpath expression="${itemXPath}">
    65.                  <var name="content"/>
    66.              </xpath>
    67.          </while>
    68.      </return>
    69.  </function>
    70.      
    71.     <!– from line included, thunderbird does not care about receiver and
    72.      date is only used to check for new mail, i.e. useless here.
    73.      see http://www.qmail.org/man/man5/mbox.html –>
    74.     <var-def name="fromLine">From - Sat Jul 12 19:52:07 2008</var-def>
    75.    
    76.     <!– clear output file as we use append later –>
    77.  <file action="write" type="text" path="${outputFile}"/>
    78.  
    79.     <!– collects all thread urls –>
    80.     <var-def name="threadUrls">
    81.         <call name="download-multipage-list">
    82.             <call-param name="pageUrl"><var name="discussionMainUrl"/></call-param>
    83.             <call-param name="nextXPath">//div[@class='maincontbox']//a[contains(.,'Older')]/@href</call-param>
    84.             <call-param name="itemXPath">//div[@class='maincontoutboxatt']//a[contains(@href,'browse_thread')]/@href</call-param>
    85.             <call-param name="maxloops"></call-param>
    86.         </call>
    87.     </var-def>
    88.  
    89.     <!– open each thread –>
    90.     <loop item="threadUrl" index="i" filter="unique">
    91.         <list>
    92.             <var name="threadUrls"/>
    93.         </list>
    94.         <body>
    95.          <empty>
    96.  
    97.     <!– absolutize thread url –>
    98.     <var-def name="threadUrlFull">
    99.      <template>${sys.fullUrl(discussionMainUrl, threadUrl.toString().trim())}</template>
    100.     </var-def>
    101.    
    102.     <!– collect all original messages urls –>
    103.     <var-def name="originalMsgUrls">
    104.      <call name="download-multipage-list">
    105.          <call-param name="pageUrl"><var name="threadUrlFull"/></call-param>
    106.          <call-param name="nextXPath">//div[@class='maincontbox']/table[1]//nobr[@id='thread_page_links_site']/a[contains(.,'Newer >')]/@href</call-param>
    107.          <call-param name="itemXPath">//div[@class='exh']//a[.='Show original']/@href</call-param>
    108.          <call-param name="maxloops"></call-param>
    109.      </call>
    110.     </var-def>
    111.    
    112.     <!– loop through messages in thread (original view) –>
    113.     <loop item="originalMsgUrl" index="j" filter="unique">
    114.      <list>
    115.       <var name="originalMsgUrls"/>
    116.      </list>
    117.      <body>
    118.       <empty>
    119.      
    120.        <!– get original message –>
    121.        <var-def name="originalMsgContent">
    122.               <http url="${sys.fullUrl(discussionMainUrl, originalMsgUrl.toString().trim() + &quot;&amp;output=gplain&quot;)}"/>
    123.        </var-def>
    124.  
    125.        <!– need to quote lines starting with "From ",">
    126. From ", ">>From ", ">>>From "…
    127.         see http://www.qmail.org/man/man5/mbox.html
    128.         –>
    129.        <var-def name="originalMsgContentQuoted">
    130.         <regexp replace="true">
    131.             <regexp-pattern>(?m)^([>]*From )</regexp-pattern>
    132.             <regexp-source>
    133.              <var name="originalMsgContent"/>
    134.             </regexp-source>
    135.             <regexp-result>
    136.              <template>&gt;${_1}</template>
    137.             </regexp-result>
    138.         </regexp>
    139.        </var-def>
    140.  
    141.        <!– process message content –>
    142.        <var-def name="originalMsgContentProcessed">
    143.         <template>${"" + fromLine + "\n" + originalMsgContentQuoted + "\n\n"}</template>
    144.        </var-def>
    145.        
    146.        <!– append content of <pre>
    147. to file –>
    148.        <file action="append" type="binary" path="${outputFile}">
    149.         <var name="originalMsgContentProcessed"/>
    150.        </file>
    151.        
    152.       </empty>
    153.      </body>
    154.     </loop>
    155.    </empty>
    156.         </body>
    157.     </loop>  
    158. </config>
  3. Change the url of the Google Groups forum to download and adjust path of the output file (See Section marked as "*** EDIT" above).
    The url must point to the discussion page (be careful with Google Groups forums containing multiple discussion categories!). You also need to make sure that the url contains the "&gvc=2" and p"&hl=en"parameter as the scripts can only parse the english version of the list thread view.
  4. Hit the play button in the toolbar, minimize the window and get back to whatever you were doing before… this might take a while! Also, make sure that you read the Limitations below before you start.

The instructions are provided for research purposes only, see disclaimer in the script above.

How it works

The script starts by collecting the urls of every thread by following the "Older" link on every page. It then collects the urls of every original message, this time following the "Newer" link. By appending "&output=gplain" to the url, it makes sure that the original message text page is requested.

The script outputs a single file in mbox file format, which can be easily imported in most e-mail clients (see this page for instructions). In order to respect the mbox convention, some quoting is done and a From line is added.

Limitations

The approach used has the following limitations:

  1. All email addresses in the messages are obsfucated using dots (…). However, I reckon it makes sense and was not a major drawback for my purpose anyway.
  2. imageRequests to the Google Groups forum site are blocked after a while when a large number of automated requests is recognized. You will see that requests take a bit longer and are about 3K size.
    If you open the Google Groups page in your browser, you will see the page shown on the right.
    How to solve this? You could either modify the script to delay the requests or you can manually crawl sessions of a certain number of threads (edit the start url - look-up how it changes if you browse to older discussion pages - and edit the maxloops parameters of the first call to the download-multiple-list helper function).

 

Hope this helps! Feel free to change the script and to notify me of any useful addition. To start changing the script, I recommend to have a look at the user manual and the examples. Also have a look at some other uses here and here.

How to get back “Send to OneNote” on Vista x64

I decided that my first post was going to be a useful one, so here it is:

Several users have been complaining about the missing “Send to OneNote” functionality after installing Vista 64 (see herehere and the several comments on an MSDN blog). The response from Microsoft has been rather indifferent, as they decided not to fix it before the next version of OneNote (see here). As it was one of the featurs I used most, I was seriously thinking about switching back to Vista 32 before a found a solution…

The workaround presented here brings back the support for printing documents into OneNote. Here is a detailled How-To:

  1. Install Zan Image Printer. The software adds a (actually two) virtual printers that generate images of the printed documents. You can get it from the Zan Image Printer download page. The tool has a 30-day trial and the full version is available for around 50$.
  2. Configure the Image Printer. The options are found by accessing the printer property page and going to the “Print Settings” page on the “General” tab (sorry I’m on a german locale):

    Printer Properties

  3. Go to the “Image” tab and select TIFF and modify the settings according to your preferences (or follow my settings).
  4. Zan Image Printer Properties - Image

  5. And now the magic bit! Switch to the “Settings” tab and select “Application” from the list box on the left side. Browse the OneNote executable (normally located in C:\Program Files (x86)\Microsoft Office\Office12). In the box “Parameters”, add the following text:
    /insertdoc "[%file]“

    Zan Image Printer Properties - Settings

  6. Try it out! Open a PDF and print it to the Zan Image Printer. The virtual printer should now generate the image, run OneNote which shows the usual dialog:
  7. <OneNote inserts printout

BTW, once you are satisfied with your settings, you can disable the Zan Printer dialog on the “Save” tab.

Hope this is useful for you and comes on time to prevent you from resinstalling your whole system!

And hey MS, how do you want to convince users to install Vista 64 when your own software is not compatible???