Time to crawl back! Download Google Groups using a crawler
For a research project, I needed an archive of a Google Group forum. Unfortunately, I had never been a member of the group and thus had not received any group messages. I tried to find a way to download an archive using the site but couldn’t find any.
However, Google Groups forums offers an original version of the message (Click on "More options" on the right of a message, then on "Show Original"). There is even a link which allows to get the original message source (Click on "Show only message text" on the page opened before). I needed a way to collect these pages and save them to a file.
The program I found is called Web-Harvest and calls itself an "Open Source Web Data Extraction tool". It allows to control the behavior of the crawler using XML scripts and Java expressions (it uses BeanShell internally), and to extract values (for example to follow links) within a page with XPath, XSLT, Regular Expressions, XQuery or even Java code. So I started setting up a script, and here is what came out:
Instructions
What you need to download Google Groups forum messages:
- Download, install and run Web-Harvest (it requires Java). On most platform you should be able to start it by double-clicking on the .jar file.
- Copy & paste the following script into the editor:
-
<!–
-
A Web-Harvest script that crawls through the pages of a Google Group forum
-
and saves the messages as an mbox file.
-
-
By: Nils Kaiser (blog.nils-kaiser.de)
-
This work is licensed under the Creative Commons Attribution 3.0 Unported License.
-
To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/legalcode
-
Please mention my name and blog address as above if you reuse or distribute this work.
-
-
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
-
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
-
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
-
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
-
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-
POSSIBILITY OF SUCH DAMAGE.
-
–>
-
-
<config charset="UTF-8">
-
-
<!– *** EDIT: set Google Group forum to crawl (main page of a discussion
-
make sure that the params gvc=2 and hl=en are used –>
-
<var-def name="discussionMainUrl" overwrite="false">http://groups.google.com/group/youtube-api-gdata/topics?hl=en&gvc=2</var-def>
-
-
<!– *** EDIT: set output file name –>
-
<var-def name="outputFile" overwrite="false">c:/output/output.mbox</var-def>
-
-
<!–
-
function download-multipage-list, removed typos
-
adapted from http://web-harvest.sourceforge.net/samples.php?num=0
-
Copyright © 2006 by vnikic at users.sourceforge.net
-
–>
-
<function name="download-multipage-list">
-
<return>
-
<while condition="${pageUrl.toString().trim().length() > 0}" maxloops="${maxloops}" index="i">
-
<empty>
-
<var-def name="content">
-
<html-to-xml>
-
<http url="${pageUrl}"/>
-
</html-to-xml>
-
</var-def>
-
-
-
<var-def name="nextLinkUrl">
-
<xpath expression="${nextXPath}">
-
<var name="content"/>
-
</xpath>
-
</var-def>
-
-
<var-def name="pageUrl">
-
-
<case>
-
<if condition="${nextLinkUrl.toString().length() > 0}">
-
<template>${sys.fullUrl(pageUrl, nextLinkUrl.toString().trim())}</template>
-
</if>
-
</case>
-
</var-def>
-
</empty>
-
-
<xpath expression="${itemXPath}">
-
<var name="content"/>
-
</xpath>
-
</while>
-
</return>
-
</function>
-
-
<!– from line included, thunderbird does not care about receiver and
-
date is only used to check for new mail, i.e. useless here.
-
see http://www.qmail.org/man/man5/mbox.html –>
-
<var-def name="fromLine">From - Sat Jul 12 19:52:07 2008</var-def>
-
-
<!– clear output file as we use append later –>
-
<file action="write" type="text" path="${outputFile}"/>
-
-
<!– collects all thread urls –>
-
<var-def name="threadUrls">
-
<call name="download-multipage-list">
-
<call-param name="pageUrl"><var name="discussionMainUrl"/></call-param>
-
<call-param name="nextXPath">//div[@class='maincontbox']//a[contains(.,'Older')]/@href</call-param>
-
<call-param name="itemXPath">//div[@class='maincontoutboxatt']//a[contains(@href,'browse_thread')]/@href</call-param>
-
<call-param name="maxloops"></call-param>
-
</call>
-
</var-def>
-
-
<!– open each thread –>
-
<loop item="threadUrl" index="i" filter="unique">
-
<list>
-
<var name="threadUrls"/>
-
</list>
-
<body>
-
<empty>
-
-
<!– absolutize thread url –>
-
<var-def name="threadUrlFull">
-
<template>${sys.fullUrl(discussionMainUrl, threadUrl.toString().trim())}</template>
-
</var-def>
-
-
<!– collect all original messages urls –>
-
<var-def name="originalMsgUrls">
-
<call name="download-multipage-list">
-
<call-param name="pageUrl"><var name="threadUrlFull"/></call-param>
-
<call-param name="nextXPath">//div[@class='maincontbox']/table[1]//nobr[@id='thread_page_links_site']/a[contains(.,'Newer >')]/@href</call-param>
-
<call-param name="itemXPath">//div[@class='exh']//a[.='Show original']/@href</call-param>
-
<call-param name="maxloops"></call-param>
-
</call>
-
</var-def>
-
-
<!– loop through messages in thread (original view) –>
-
<loop item="originalMsgUrl" index="j" filter="unique">
-
<list>
-
<var name="originalMsgUrls"/>
-
</list>
-
<body>
-
<empty>
-
-
<!– get original message –>
-
<var-def name="originalMsgContent">
-
<http url="${sys.fullUrl(discussionMainUrl, originalMsgUrl.toString().trim() + "&output=gplain")}"/>
-
</var-def>
-
-
<!– need to quote lines starting with "From ",">
-
From ", ">>From ", ">>>From "…
-
see http://www.qmail.org/man/man5/mbox.html
-
–>
-
<var-def name="originalMsgContentQuoted">
-
<regexp replace="true">
-
<regexp-pattern>(?m)^([>]*From )</regexp-pattern>
-
<regexp-source>
-
<var name="originalMsgContent"/>
-
</regexp-source>
-
<regexp-result>
-
<template>>${_1}</template>
-
</regexp-result>
-
</regexp>
-
</var-def>
-
-
<!– process message content –>
-
<var-def name="originalMsgContentProcessed">
-
<template>${"" + fromLine + "\n" + originalMsgContentQuoted + "\n\n"}</template>
-
</var-def>
-
-
<!– append content of <pre>
-
to file –>
-
<file action="append" type="binary" path="${outputFile}">
-
<var name="originalMsgContentProcessed"/>
-
</file>
-
-
</empty>
-
</body>
-
</loop>
-
</empty>
-
</body>
-
</loop>
-
</config>
-
- Change the url of the Google Groups forum to download and adjust path of the output file (See Section marked as "*** EDIT" above).
The url must point to the discussion page (be careful with Google Groups forums containing multiple discussion categories!). You also need to make sure that the url contains the "&gvc=2" and p"&hl=en"parameter as the scripts can only parse the english version of the list thread view. - Hit the play button in the toolbar, minimize the window and get back to whatever you were doing before… this might take a while! Also, make sure that you read the Limitations below before you start.
The instructions are provided for research purposes only, see disclaimer in the script above.
How it works
The script starts by collecting the urls of every thread by following the "Older" link on every page. It then collects the urls of every original message, this time following the "Newer" link. By appending "&output=gplain" to the url, it makes sure that the original message text page is requested.
The script outputs a single file in mbox file format, which can be easily imported in most e-mail clients (see this page for instructions). In order to respect the mbox convention, some quoting is done and a From line is added.
Limitations
The approach used has the following limitations:
- All email addresses in the messages are obsfucated using dots (…). However, I reckon it makes sense and was not a major drawback for my purpose anyway.
Requests to the Google Groups forum site are blocked after a while when a large number of automated requests is recognized. You will see that requests take a bit longer and are about 3K size.
If you open the Google Groups page in your browser, you will see the page shown on the right.
How to solve this? You could either modify the script to delay the requests or you can manually crawl sessions of a certain number of threads (edit the start url - look-up how it changes if you browse to older discussion pages - and edit the maxloops parameters of the first call to the download-multiple-list helper function).
Hope this helps! Feel free to change the script and to notify me of any useful addition. To start changing the script, I recommend to have a look at the user manual and the examples. Also have a look at some other uses here and here.
Nils Kaiser





