Skip to content

Latest commit

 

History

History
149 lines (117 loc) · 16.4 KB

README.md

File metadata and controls

149 lines (117 loc) · 16.4 KB

Google Local Results AI Parser

Gem Version Contributors Forks Stargazers Issues Issues MIT License

AI-Powered Parser

google-local-results-ai-parser is a gem developed by SerpApi. It provides a parser for extracting structured data from Google Local Search Results using the serpapi/bert-base-local-results transformer model, enabling you to parse HTML content of Google Local Results Listings in English, extract relevant information, and classify it into different categories such as ratings, reviews, descriptions, and more.

Relevant Sources

Installation

Add this line to your application's Gemfile:

gem 'google-local-results-ai-parser'

And then execute the following in your terminal:

$ bundle install

Or install it yourself in your terminal:

$ gem install google-local-results-ai-parser

Usage

To use the google-local-results-ai-parser gem, you need to include the necessary dependencies:

require 'google-local-results-ai-parser'

Parsing HTML

The main functionality of the gem is to parse HTML content and extract structured data from it. You can use the GoogleLocalResultsAiParser.parse method to parse a single HTML document:

html = "<div jscontroller=\"AtSb\" class=\"w7Dbne\" data-record-click-time=\"false\" id=\"tsuid_25\" jsdata=\"zt2wNd;_;BvbRxs V6f1Id;_;BvbRxw\" jsaction=\"rcuQ6b:npT2md;e3EWke:kN9HDb\" data-hveid=\"CBUQAA\"><div jsname=\"jXK9ad\" class=\"uMdZh tIxNaf\" jsaction=\"mouseover:UI3Kjd\"><div class=\"VkpGBb\"><div class=\"cXedhc\"><a class=\"vwVdIc wzN8Ac rllt__link a-no-hover-decoration\" jsname=\"kj0dLd\" data-cid=\"12176489206865957637\" jsaction=\"click:h5M12e;\" role=\"link\" tabindex=\"0\" data-ved=\"2ahUKEwiS1P3_j-P7AhXnVPEDHa0oAiAQvS56BAgVEAE\"><div><div class=\"rllt__details\"><div class=\"dbg0pd\" aria-level=\"3\" role=\"heading\"><span class=\"OSrXXb\">Y Coffee</span></div><div><span class=\"Y0A0hc\"><span class=\"yi40Hd YrbPuc\" aria-hidden=\"true\">4.0</span><span class=\"z3HNkc\" aria-label=\"Rated 4.0 out of 5,\" role=\"img\"><span style=\"width:56px\"></span></span><span class=\"RDApEe YrbPuc\">(418)</span></span> · <span aria-label=\"Moderately expensive\" role=\"img\">€€</span> · Coffee shop</div><div>Nicosia</div><div class=\"pJ3Ci\"><span>Iconic Seattle-based coffeehouse chain</span></div></div></div></a><a class=\"uQ4NLd b9tNq wzN8Ac rllt__link a-no-hover-decoration\" aria-hidden=\"true\" tabindex=\"-1\" jsname=\"kj0dLd\" data-cid=\"12176489206865957637\" jsaction=\"click:h5M12e;\" role=\"link\" data-ved=\"2ahUKEwiS1P3_j-P7AhXnVPEDHa0oAiAQvS56BAgVEA4\"><g-img class=\"gTrj3e\"><img id=\"pimg_3\" src=\"https://lh5.googleusercontent.com/p/AF1QipPaihclGQYWEJpMpBnBY8Nl8QWQVqZ6tF--MlwD=w184-h184-n-k-no\" class=\"YQ4gaf zr758c wA1Bge\" alt=\"\" data-atf=\"4\" data-frt=\"0\" width=\"92\" height=\"92\"></g-img></a></div></div></div></div>"
bearer_token = 'Huggingface Token or Private Server Key'
result = GoogleLocalResultsAiParser.parse(html: html, bearer_token: bearer_token)

image

The result variable will contain a hash with the extracted data classified into different categories. For example:

{
  "address" => "Nicosia",
  "description" => "Iconic Seattle-based coffeehouse chain",
  "price" => "€€",
  "reviews" => "418",
  "rating" => "4.0",
  "type" => "Coffee shop"
}

Parsing Multiple HTML Parts

If you have multiple HTML parts that you want to parse concurrently, you can use the GoogleLocalResultsAiParser.parse_multiple method. This method takes an array of HTML parts and returns an array of parsed results:

html_parts = [
                '<div jscontroller=\"AtSb\" class=\"w7Dbne\" data-record-click-time=\"false\" id=\"tsuid_25\" jsdata=\"zt2wNd;_;BvbRxs V6f1Id;_;BvbRxw\" jsaction=\"rcuQ6b:npT2md;e3EWke:kN9HDb\" data-hveid=\"CBUQAA\"><div jsname=\"jXK9ad\" class=\"uMdZh tIxNaf\" jsaction=\"mouseover:UI3Kjd\"><div class=\"VkpGBb\"><div class=\"cXedhc\"><a class=\"vwVdIc wzN8Ac rllt__link a-no-hover-decoration\" jsname=\"kj0dLd\" data-cid=\"12176489206865957637\" jsaction=\"click:h5M12e;\" role=\"link\" tabindex=\"0\" data-ved=\"2ahUKEwiS1P3_j-P7AhXnVPEDHa0oAiAQvS56BAgVEAE\"><div><div class=\"rllt__details\"><div class=\"dbg0pd\" aria-level=\"3\" role=\"heading\"><span class=\"OSrXXb\">Y Coffee</span></div><div><span class=\"Y0A0hc\"><span class=\"yi40Hd YrbPuc\" aria-hidden=\"true\">4.0</span><span class=\"z3HNkc\" aria-label=\"Rated 4.0 out of 5,\" role=\"img\"><span style=\"width:56px\"></span></span><span class=\"RDApEe YrbPuc\">(418)</span></span> · <span aria-label=\"Moderately expensive\" role=\"img\">€€</span> · Coffee shop</div><div>Nicosia</div><div class=\"pJ3Ci\"><span>Iconic Seattle-based coffeehouse chain</span></div></div></div></a><a class=\"uQ4NLd b9tNq wzN8Ac rllt__link a-no-hover-decoration\" aria-hidden=\"true\" tabindex=\"-1\" jsname=\"kj0dLd\" data-cid=\"12176489206865957637\" jsaction=\"click:h5M12e;\" role=\"link\" data-ved=\"2ahUKEwiS1P3_j-P7AhXnVPEDHa0oAiAQvS56BAgVEA4\"><g-img class=\"gTrj3e\"><img id=\"pimg_3\" src=\"https://lh5.googleusercontent.com/p/AF1QipPaihclGQYWEJpMpBnBY8Nl8QWQVqZ6tF--MlwD=w184-h184-n-k-no\" class=\"YQ4gaf zr758c wA1Bge\" alt=\"\" data-atf=\"4\" data-frt=\"0\" width=\"92\" height=\"92\"></g-img></a></div></div></div></div>',
                '<div jscontroller=\"AtSb\" class=\"w7Dbne\" data-record-click-time=\"false\" id=\"tsuid_25\" jsdata=\"zt2wNd;_;BvbRxs V6f1Id;_;BvbRxw\" jsaction=\"rcuQ6b:npT2md;e3EWke:kN9HDb\" data-hveid=\"CBUQAA\"><div jsname=\"jXK9ad\" class=\"uMdZh tIxNaf\" jsaction=\"mouseover:UI3Kjd\"><div class=\"VkpGBb\"><div class=\"cXedhc\"><a class=\"vwVdIc wzN8Ac rllt__link a-no-hover-decoration\" jsname=\"kj0dLd\" data-cid=\"12176489206865957637\" jsaction=\"click:h5M12e;\" role=\"link\" tabindex=\"0\" data-ved=\"2ahUKEwiS1P3_j-P7AhXnVPEDHa0oAiAQvS56BAgVEAE\"><div><div class=\"rllt__details\"><div class=\"dbg0pd\" aria-level=\"3\" role=\"heading\"><span class=\"OSrXXb\">X Coffee</span></div><div><span class=\"Y0A0hc\"><span class=\"yi40Hd YrbPuc\" aria-hidden=\"true\">4.0</span><span class=\"z3HNkc\" aria-label=\"Rated 4.0 out of 5,\" role=\"img\"><span style=\"width:56px\"></span></span><span class=\"RDApEe YrbPuc\">(418)</span></span> · <span aria-label=\"Moderately expensive\" role=\"img\">€€</span> · Coffee shop</div><div>Nicosia</div><div class=\"pJ3Ci\"><span>Iconic Washington-based coffeehouse chain</span></div></div></div></a><a class=\"uQ4NLd b9tNq wzN8Ac rllt__link a-no-hover-decoration\" aria-hidden=\"true\" tabindex=\"-1\" jsname=\"kj0dLd\" data-cid=\"12176489206865957637\" jsaction=\"click:h5M12e;\" role=\"link\" data-ved=\"2ahUKEwiS1P3_j-P7AhXnVPEDHa0oAiAQvS56BAgVEA4\"><g-img class=\"gTrj3e\"><img id=\"pimg_3\" src=\"https://lh5.googleusercontent.com/p/AF1QipPaihclGQYWEJpMpBnBY8Nl8QWQVqZ6tF--MlwD=w184-h184-n-k-no\" class=\"YQ4gaf zr758c wA1Bge\" alt=\"\" data-atf=\"4\" data-frt=\"0\" width=\"92\" height=\"92\"></g-img></a></div></div></div></div>',
                ...
             ]
bearer_token = 'Huggingface Token or Private Server Key'
results = GoogleLocalResultsAiParser.parse_multiple(html_parts: html_parts, bearer_token: bearer_token)

The results variable will contain an array of hashes, with each hash representing the extracted data from a corresponding HTML part:

[
    {
      "address" => "Nicosia",
      "description" => "Iconic Seattle-based coffeehouse chain",
      "price" => "€€",
      "reviews" => "418",
      "rating" => "4.0",
      "type" => "Coffee shop"
    },
    ...
]

Advanced Usage

The google-local-results-ai-parser gem provides several advanced features to handle different scenarios and improve the accuracy of the parsing results.

Configuration Options

The gem provides some configuration options that you can customize according to your needs:

  • server: The API server URL for the Hugging Face model. The default value is https://api-inference.huggingface.co/models/serpapi/bert-base-local-results. You may change the value to your desired endpoint.

    • Free Inference API: This default endpoint is a Free Inference API for Fast Prototyping offered by Huggingface. It might not be up all the time, or could get rate-limited depending on your usage. You may find the relevant information about Free Inference API at Huggingface Documentation.
    • Production-Ready Inference API Endpoints: For Production or Heavy-Load Prototyping, one of the options is to set up your Private Production-Ready Inference API Endpoints from serpapi/bert-base-local-results Repository Page. You may find the relevant information about Production-Ready Inference API Endpoints at Huggingface Documentation.
    • Private Server: Another option for Production or Heavy-Load Prototyping is setting up your Private Server that mimics the Production-Ready Inference API. SerpApi has open-sourced an example server code called google-local-results-ai-server. The default path to make POST Requests will be at /models/serpapi/bert-base-local-results once you deploy or locally set up the private server.
  • separator_regex: A regular expression used to split the extracted text into separate parts. The default value is /\n|·|⋅/. This ruby regex is splitting the text where they are separated in a Google Local Result. The default value is enough for recent results, and may be changed in the future if the text structure at Google Local Results changes.

  • rejected_css: CSS selector to exclude specific elements from the parsing process. The default value is "[role='heading'], a[ping], [class*='label']". These CSS selectors contain titles, links, and labels which are excluded from the parser due to their easy-to-scrape nature. Any enrichment to these CSS selectors should be done according to Nokolexbor standards. Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.

  • broken_css: CSS selector to break down bold text and reduce noise. The default value is "b:has(::text)". The default value is enough to break down elements that create a noise for the model by extracting different parts of the same text as separate items for recent results. It may be changed in the future if the HTML structure of the Google Local Results changes. Any enrichment to these CSS selectors should be done according to Nokolexbor standards as well.

  • iteration: The maximum number of iterations for after-corrections. The default value is 1. The gem uses after-corrections to model's predictions to serve more precise results. You may find the known limitations at the serpapi/bert-base-local-results. The 3 types of after-corrections are:

    • Safety measures: Safety measures are for protecting against the noise of the unpredicted cases. For now, there is only one kind of safety measure in the gem, and that is checking if an excerpt button text has been caught and deducted from the result. The button texts are handled by the traditional part of SerpApi's Google Local API, and should be deducted if caught.
    • Known clashes: Sometimes the model serves clashing results for one label. This may happen due to limitations of the model, the generality of meaning of the classified text, or the limitations of the dataset model was trained on. The gem can clear out the majority of the known clashes, and correct them using traditional logical algorithms.
    • General clashes: The model can compare the assurance score of two texts with same label, and pick the one with a higher score to automatically correct the results after predictions. From raw observations, doing after-corrections only once is observed to be enough. You may increase iteration parameter to force the after-corrections more in case of any further clashes.
  • debug: The parameter allows the returning of debugging time information needed to calculate maximum time it takes to connect to the server. You may take a look at example_debug.rb and example_multiple_debug.rb for reference.

  • no_cache: The parameter forces the model endpoint to make a prediciton instead of serving cached results. This is also used for debugging purposes to see the initial load of the model. You may take a look at example_debug.rb and example_multiple_debug.rb for reference.