Tag: jupyter

  • Fun with Patent Data: Thomas Edison Jupyter Notebook

    Thomas Alva Edison was a famous American inventor and businessman, “described as America’s greatest inventor”, and was one of the most prolific inventors in US history. Thomas Edison was granted/filed 1084 patents from 1847-1931.[1] He’s just one cool inventor – lamps, light bulbs, phonograph and so many more life changing inventions.

    Google Patents has a wonderful depth of patent history, and the history is searchable with custom search strings:

    • inventor:(Thomas Edison) before:priority:19310101
    • inventor:(Paul R Bastide) after:priority:2009-01-01

    Google provides a seriously cool feature – a downloadable csv. Pandas anyone? The content is provided in an agreement between the USPTO and Google. Google also provides it as part of the Google APIs/Platform. The data is fundamentally public, and Google has made it very accessible with some GitHub examples. [2] The older patent data more difficult to search as the content has been scraped from Optical Character Recognition.

    I have found a cross-section of three things I am very interested in: History, Inventing and Data Science. Time to see what cool things about the Edison data.

    Step

    To start the playing with the data, one must install Jupyter.

    python3 -m pip install --upgrade pip
    python3 -m pip install jupyter

    Launch jupyter and navigate to the http://localhost:8888/tree

    jupyter notebook

    Load and Launch the notebook

    1. Download the Edison.ipynb
    2. Unzip the Edison.ipynb.zip
    3. Upload the Edison.ipynb to Jupyter
    4. Launch the Edison notebook and follow along with the cells.

    The notebook renders some interesting insights using numpy, pandas, matplotlib and scipy. The notebook includes a cell to install python libraries, and once one executes the per-requisites cell; all is loaded.

    The Jupyter notebook loads the data using an input cell, once run, the analytics enable me to see the number of co-inventors (but need to cleanse the data first).

    One notices that Thomas Alva is not an inventor in those results, as such one needs to modify to the notebook to use the API with more recent Inventors. With the comprehensive APIs from USPTO, one extracts patent data by one of a number of JSON REST APIs. Kudos to the USPTO to really open up the data and the API.

    Conclusion

    All-in the APIs/Python/Jupyter Notebook/Analysis are for fun, and provide insight into Thomas Edison’s patent data – one focused individual.

    References

    [1] Prolific Inventors https://en.wikipedia.org/wiki/List_of_prolific_inventors number wise it appears to conflict with https://en.wikipedia.org/wiki/List_of_Edison_patents which reports 1093 (it’s inclusive of design patents)
    [2] Google / USPTO Patent Data https://www.google.com/googlebooks/uspto-patents.html
    [3] USPTO Open Data https://developer.uspto.gov/about-open-data and https://developer.uspto.gov/api-catalog
    [4] PatentsView http://www.patentsview.org/api/faqs.html

  • Jupyter Notebook: Email Analysis to a Lotus Notes View

    I wanted to do an analysis of my emails since I joined IBM, and see the flow of messages in-and-out of my inbox.

    With my preferences for Jupyter Notebooks, I built a small notebook for analysis.

    Steps
    Open IBM Lotus Notes Rich Client

    Open the Notes Database with the View you want to analyze.

    Select the View you are interested in ‘All Documents’. For instance the All Documents view, like my inbox *obfuscated* with a purpose.

    Click File > Export

    Enter a file name – email.csv

    Select Format “Comma Separate Value”

    Click Export

    Upload the Notebook to your Jupyter server

    The notebook is describes the flow through my process. If you encounter ValueError: (‘Unknown string format:’, ’12/10/2018 08:34 AM’), you can refer to https://stackoverflow.com/a/8562577/1873438

    iconv -c -f utf-8 -t ascii email.csv > email.csv.clean

    You can break the data into month-year-day analysis with the following, and peek the results with df_emailA.head()

    When you run the final cell, the code generates a Year-Month-Day count as a bar graph.

        # Title: Volume in Months when emails are sent.
        # Plots volume based on known year-mm-dd
        # to be included in the list, one must have data in those years.
        # Kind is a bar graph, so that the (Y - YYYY,MM can be read)
        y_m_df = df_emailA.groupby(['year','month','day']).year.count()
        y_m_df.plot(kind="bar")
    
        plt.title('Numbers submitted By YYYY-MM-DD')
        plt.xlabel('Email Flow')
        plt.ylabel('Year-Month-Day')
        plt.autoscale(enable=True, axis='both', tight=False)
        plt.rcParams['figure.figsize'] = [20, 200]

    You’ll see the trend of emails I receive over the years.

    Trends of Email