====== DocSearch Plugin ======

---- plugin ----
description: Search through your uploaded documents
author     : Dominik Eckelmann
email      : dokuwiki@cosmocode.de 
type       : action
lastupdate : 2016-07-18
compatible : Hogfather, 2009-08-01, 2013-05-10
depends    : 
conflicts  : 
similar    : elasticsearch
tags       : search

downloadurl: https://github.com/cosmocode/docsearch/zipball/master
bugtracker : https://github.com/cosmocode/docsearch/issues
sourcerepo : https://github.com/cosmocode/docsearch
donationurl: 
----

This plugin allows you to search through your uploaded documents. It is integrated into the default DokuWiki search. Just fill in a search string and start to search.

:!: A probably better alternative to this plugin, is the [[plugin:elasticsearch|elasticsearch Plugin]] with its ability to index documents.

[[https://www.cosmocode.de/en/open-source/dokuwiki-plugins/|{{ https://www.cosmocode.de/static/img/dokuwiki/dwplugins.png?recache|A CosmoCode Plugin}}]]

===== Download and Installation =====

Search and install the plugin using the [[plugin:extension|Extension Manager]]. Refer to [[:Plugins]] on how to install plugins manually.

==== Changes ====

{{rss>https://github.com/cosmocode/docsearch/commits/master.atom author date}}

==== Cronjob ====

To create the search index you have to set up a cronjob (or scheduled task under windows) that runs the ''dokuwiki/lib/plugins/docsearch/cron.php''. You can also use online cron job service https://www.easycron.com to trigger the script and a tutorial at https://www.easycron.com/cron-job-tutorials/how-to-set-up-cron-job-for-dokuwiki-docsearch.

The search just finds documents in the index. If you create the index, upload a new file and search for the file, you won't find it until you rebuild the index.

It is possible that you need to increase the ''memory_limit'' from your PHP configuration. See [[phpfn>ini.core]]

:!: Because docsearch uses the cron.php as a CLI-PHP (Command Line) script, you have to increase memory_limit in /etc/php5/cli/php.ini <sub>//Joachim 10.01.2011//</sub>

Note: if you run a DokuWiki [[:farm]], you need to run the cronjob for each animal seperately, passing the animal's name as first parameter to the script.

==== Configuration ====

To configure the search you have to edit the ''dokuwiki/lib/plugins/docsearch/conf/converter.php''.

Use this file to setup the document to text converter.

The plugin tries to convert every media document to a text file. On this progress it uses a given set of external tools to convert it.
These tools are defined per file extension. The config stores one extension and its tool per line.
You can use %in% and %out% for the input and output file.

the abstract syntax is
<code>
fileextension /path/to/converter -with_calls_to_convert --from inputfile --to outputfile
</code>
^ :!: you can use the [[plugin:confmanager|ConfManager Plugin]] to edit the config ^
example config for pdf and doc:
<code>
#<?php die(); ?>
pdf /usr/bin/pdftotext -enc UTF-8 %in% %out%
doc /usr/bin/antiword %in% > %out%
odt /usr/bin/odt2txt %in% --output=%out%
</code>
the first line disallows users to browse this file with a browser.

the second line says the PDF extension, the path to the converter with the two parameters %in% and out.

the third line covers doc documents. Antiword just print the text to the stdout so we use the > to get the text into a file.

you have to ensure that the output file is UTF-8 encoded. Otherwise you might get trouble with non-ASCII characters.
===== Todo =====
  * Allow the user to use the DokuWiki indexer to index the documents.
  * Index documents that have been modified or changed only. -> Skip already indexed documents for performance reasons.

===== Conversion settings =====

==== Office documents ====

=== Using jodconverter and OpenOffice.org ===

I would like to share some conversion settings which worked for me

I am using the [[http://artofsolving.com/opensource/jodconverter|jodconverter]] together with [[http://little.bluethings.net/2008/05/30/automating-document-conversion-in-linux-using-jodconverterooo/|openoffice in headless mode]] and the following settings:

<code php converter.php>
doc <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
docx <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
odt <pathtojava>java -jar <pathtojodconverter>jodconverter-cli-2.2.2.jar %in% %out%
</code>

The calc formats ods, xls, xlsx can be converted by using scripts to convert them to .csv first using jodconverter and then rename them to .txt. Unfortunately only the first spreadsheet gets converted when output is csv. Using PDF conversion all spreadsheets including their names get converted (tested only for ods).

Unfortunately the jodconverter does not convert ppt or pptx directly to txt. It would be possible to convert them first to a PDF and run the the pdftotxt converter afterward but I don't like the overhead of such a chained solution. Are there any free command line tools out there to convert the mentioned format on a Linux machine?

**HINT:**When using OpenOffice.org in headless mode. Make sure you have enough memory. Otherwise it can crash and the indexing of all following documents will fail -> jodconverter complains that it can not connect to the OpenOffice.org server.

=== Using jodconverter OpenOffice.org and a script ===

<code php converter.php>
ppt                             <path to office2txt.sh>/office2txt.sh %in% %out%
pptx                            <path to office2txt.sh>/office2txt.sh %in% %out%
odp                             <path to office2txt.sh>/office2txt.sh %in% %out%
xls                             <path to office2txt.sh>/office2txt.sh %in% %out%
xlsx                            <path to office2txt.sh>/office2txt.sh %in% %out%
ods                             <path to office2txt.sh>/office2txt.sh %in% %out%
</code>

Here is the bash script I am using to do a chained conversion because jodconverter can not convert them directly to txt files. First to PDF and then to txt. Comments welcome since I am no bash guru...

<file bash office2txt.sh>
#!/bin/bash
# Converter script to convert almost everything openoffice can read to txt using the jodconverter
# and the pdf2txt tool
# Because the jodconverter can not convert files formats like ppt, pptx, xls, ods, xlsx to txt directly,
# a conversion to PDF is performed first using the jodconvert. The second step is a conversion from
# PDF to txt using the pdftotxt commandline tool
# usage: all2text.sh <inputfile> <outputfile>
# <inputfile> is a arbitrary file open office can read (with correct file extension!)
# <outputfile> is the filename the result should go to. (txt as file extension)
#
# adapt the settings below to your own needs


echo "Input: $1"

#jodconverter binary cmd
JODCONVERTER_CMD=/opt/jodconverter/lib/jodconverter-cli-2.2.2.jar
#pdf2txt binary cmd (find out your path using the 'which pdftotxt' cmd)
PDF2TXT_CMD=/usr/bin/pdftotext
#your java cmd
JAVA_CMD=/usr/bin/java
#temporary folder for storing the PDF (path without trailing /)(you need to have write access here!)
TMP_FOLDER=/tmp/pdftmp

#extract input name
input_fullfile=$1
input_filename_w_ext=$(basename "$input_fullfile")
input_extension=${input_filename_w_ext##*.}
input_filename_wo_ext=${input_filename_w_ext%.*}

#first conversion to PDF:
tmpfile=$TMP_FOLDER/$input_filename_wo_ext".pdf"
$JAVA_CMD -jar $JODCONVERTER_CMD "$input_fullfile" "$tmpfile"

#second conversion to txt:
$PDF2TXT_CMD "$TMP_FOLDER/$input_filename_wo_ext.pdf" "$2"

#remove tmp file
rm -f $tmpfile
</file>

An Alternative to openoffice is to use Apache Tika http://tika.apache.org/
Example:
  /usr/bin/java -jar /path/to/apache-tika/tika-app-x.xx.jar --text %in% > %out%

==== Mindmaps from FreeFind ====

=== Using xsl transformation ===

To convert files generated from FreeFind (.mm) to text one can use a xslt transformation with a xsl document provided by FreeFind ( I took mm2csv.xsl from FreeFind 0.9beta which worked well on files generated with 0.8.1)

<code php converter.php>
mm                              <path to converter script>/mm2txt.sh %in% %out%
</code>

Here is the little script which uses xmlstarlet to apply the xsl document to the FreeFind file<file bash mm2txt.sh>#!/bin/bash
# Converter script to convert mindmaps generated by FreeFind to txt
# The conversion is done by a xsl definition and the commandline tool xmlstarlet
# The used xsl file "mm2csv.xsl" can be found inside the FreeFind 0.9 (beta) archive
# at the folder accessories which can be downloaded at http://freemind.sourceforge.net

#Full path to the xsl file
XSL_FILE=/opt/mm2csv.xsl
#Full path to the commandline converter xmlstarlet
XML_STARLET=/usr/bin/xmlstarlet

#conversion
$XML_STARLET tr $XSL_FILE $1 > $2
</file>

==== ZIP Files ====

For zip files this little script can be used. The command line tools for the conversion need to be added for each document type. The known document types get extracted to a temp folder where they are converted to txt and joined to one big text file which can be indexed.

Currently only conversion tools are supported which have the following style: ''<cmd> <inputfile> <outputfile>''

<file bash zip2txt.sh>
#!/bin/bash                                           
# This is a converter script to convert the content from a zip file to a single txt file.
# All files which extensions are defined in this script get unzipped, converted to text and joined to one single output file
# usage: zip2txt.sh <inputfile> <outputfile>                                                                               

#adapt this:
#Folder where the zip file is unpacked WARNING: DO NOT USE THIS FOLDER FOR ANYTHING ELSE -> all files in there will be converted!
TMPFOLDER="/tmp/zipconverter"         
#File which is used as a temporary storage
#DO NOT PLACE THE TMPFILE INSIDE/BELOW THE TMPFOLDER IF YOU DON'T EXACTLY KNOW WHAT YOUR ARE DOING
TMPFILE="/tmp/zipconverstion.txt"                                                                 
#commands needed for this script                                                                  
UNZIP_CMD="/usr/bin/unzip"                                                                        
FIND_CMD="/usr/bin/find"                                                                          

#extent the extention and command  array for your personal needs
#note: the first parameter of the cmd must be the input, the second is the output filename. e.g. /opt/office2txt.sh <inputfile> <outputfile>
FILEEXT[0]="doc"; CMD[0]="/opt/office2txt.sh"                                                                                          
FILEEXT[1]="pdf"; CMD[1]="/usr/bin/pdftotext"                                                                                          

#IO definitions
zipfile=$1     
outputfile=$2  

#generate filter string from FILEEXT
filter=""                           
for ext in "${FILEEXT[@]}"          
do                                  
  filter="$filter *.$ext"
done

#Unzip only content into TMPFOLDER with known extensions, ignoring case sensitivity of filter "-C",
# The  "-P \n" is needed to tell unzip that we do not have a valid password so it does not ask on stdin
# if a file is encrypted                                                                               
$UNZIP_CMD -o -qq -C -P \n $1$filter -d $TMPFOLDER  

#put all filenames into an array which are inside the TMPFOLDER.
#Whitespaces in filenames are handled correctly (from http://mywiki.wooledge.org/BashFAQ/020)
unset filenames i
while IFS= read -r -d '' file; do
  filenames[i++]=$file
#  echo "File: ${filenames[i-1]}"
done < <($FIND_CMD $TMPFOLDER -type f -print0)

#switch off case sensitivity
shopt -s nocasematch

#convert each file to txt according the command set in CMD
for file in "${filenames[@]}"
do
echo "Working on file: $file"
  #get fileextention
  input_filename_w_ext=$(basename "$file")
  input_extension=${input_filename_w_ext##*.}
  #search extension in FILEEXT array (case insensitive)
  # get length of an array
  tLen=${#FILEEXT[@]}
  extfount=0
  for (( i=0; i<${tLen}; i++ ));
  do
    if [[ ${FILEEXT[$i]} = $input_extension ]]
    then
      rm -f $TMPFILE #make sure it is empty
      #execute conversion cmd
      echo ${CMD[$i]} "$file" "$TMPFILE"
      ${CMD[$i]} "$file" "$TMPFILE"
      #append $TMPFILE to output file $outputfile
      cat $TMPFILE >> $outputfile
      break
    fi
  done
done
#switch on case sensitivity
shopt -u nocasematch

#remove all stuff in the temp folder and the temp file
rm -rf $TMPFOLDER/*
rm -f $TMPFILE
</file>

**WARNING:** Because this script joins all content found in the zip file to one huge text file, the indexing process (PHP) will need a lot of memory! You better dump the output of this conversion script to a logfile and check it on a regular basis or for errors! To increase the memory have a look at the tips at the top of the page.
I had to set the memory limit of PHP to 250 MB because the generated txt file from this script was 8.8 MByte in size. This can happen very easy if a zip file contains a lot of PDF documents!

===== Installation in Windows 2003 =====

I run Dokuwiki for our company's intranet on a Windows 2003 server with XAMPP. The following short description explains how I got //docsearch// to run on this system.

==== Converters ====

For me, the following converters worked:
  * **PDF:** pdftotext, which you can find here http://www.foolabs.com/xpdf/download.html
  * **Office documents:** cattdoc, xls2csv and catppt, which you can  find here http://blog.brush.co.nz/2009/09/catdoc-windows/ The conversion for DOT, XLT and the newer Office formats is not perfect but the quality, in my opinion, is sufficient as an input for docsearch.

Install them in an appropriate location, i.e. C:\TOOLS and adjust the converter.php-file (replace the [PATH TO] with your actual path, i.e. C:\TOOLS\XPDF and C:\TOOLS\CATDOC).
<code php converter.php>
pdf    [PATH TO]\pdftotext.exe %in% %out%
doc    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
xls    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
ppt    [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out%
docx    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
xlsx    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
pptx    [PATH TO]\catppt.exe -s koi8-u -d koi8-u %in% > %out%
xlt    [PATH TO]\xls2csv.exe -s koi8-u -d koi8-u %in% > %out%
dot    [PATH TO]\catdoc.exe -a -s koi8-u -d koi8-u %in% > %out%
</code>
For catdoc, //-a -s koi8-u -d koi8-u// means
  * -a: output is in ascii-format
  * -s koi8-u: source (input) is in utf-8 unicode format (which should be the case for Fffice documents)
  * -d koi8-u: destination (output) is in utf-8 unicode format

==== Cronjob ====

Instead of a cronjob, set up a scheduled task in Windows to index new files. To do this, go to //Start->Programs->Accessories->System Tools->Scheduled Tasks//, set up a new task and enter the following into the "run" field: "[PATH TO]\php.exe [PATH TO]\cron.php" (the cron.php file is in a subdirectory of the docsearch plugin).

//Example:// The "run" field should contain something like this:
<code>C:\Programs Files\php\php.exe C:\Website\dokuwiki\lib\plugins\docsearch\cron.php</code>
or with XAMPP as a server environment
<code>C:\xampp\php\php.exe C:\xampp\htdocs\dokuwiki\lib\plugins\docsearch\cron.php</code>

The rest of the setup should be straight forward.

==== Issues ====

You may get a //"file not found"// or //"path not found"// error under ''%%cron.php%%'' when using some utilities or command line expressions in the ''%%converter.php%%'' file. This is due to the path slashes not being formatted to DOS/Windows format.

To fix this, insert this code around line 87 in cron.php after ''%%$cmd = str_replace('%out%', escapeshellarg($out), $cmd);%%'':

<code>
if (strtoupper(substr(PHP_OS, 0, 3)) === 'WIN') {
		$cmd = str_replace('/', '\\', $cmd);
}
</code>